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Abstract 

Variational inference is a very efficient and popular heuristic used in various forms in the context of 
latent variable models. It’s closely related to Expectation Maximization (EM), and is applied when exact 
EM is computationally infeasible. Despite being immensely popular, current theoretical understanding 
of the effectiveness of variaitonal inference based algorithms is very limited. In this work we provide the 
first analysis of instances where variational inference algorithms converge to the global optimum, in the 
setting of topic models. 

More specifically, we show that variational inference provably learns the optimal parameters of a topic 
model under natural assumptions on the topic-word matrix and the topic priors. The properties that the 
topic word matrix must satisfy in our setting are related to the topic expansion assumption introduced 
in (Anandkurnar et al., 20111), as well as the anchor words assumption in (Arora et ah, 2012b). The 
assumptions on the topic priors are related to the well known Dirichlet prior, introduced to the area of 
topic modeling by (Blei et al., 2003). 

ft is well known that initialization plays a crucial role in how well variational based algorithms perform 
in practice. The initializations that we use are fairly natural. One of them is similar to what is currently 
used in LDA-c, the most popular implementation of variational inference for topic models. The other one 
is an overlapping clustering algorithm, inspired by a work by (Arora et al., 2014) on dictionary learning, 
which is very simple and efficient. 

While our primary goal is to provide insights into when variational inference might work in practice, 
the multiplicative, rather than the additive nature of the variational inference updates forces us to use 
fairly non-standard proof arguments, which we believe will be of general interest. Our proofs rely on 
viewing the updates as an operation which, at each timestep, sets the new parameter estimates to be 
noisy convex combinations of the ground truth values, and a bounded error term which depends on the 
previous estimate. The weight on the ground truth values will be large, compared to the error term, 
which will cause the error term to eventually reach zero. The large weight on the ground truth values 
will be a byproduct of our model assumptions, which will imply a “local” notion of anchor words for 
each document - words which only appear in one topic in a given document, as well as a “local” notion 
of anchor documents for each word - documents where that word appears as part of a single topic. 
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1 Introduction 

Over the last few years, heuristics for non-convex optimization have emerged as one of the most fascinating 
phenomena for theoretical study in machine learning. Methods like alternating minimization, EM, variational 
inference and the like enjoy immense popularity among ML practitioners, and with good reason: they’re 
vastly more efficient than alternate available methods like convex relaxations, and are usually easily modified 
to suite different applications. 

Theoretical understanding however is sparse and we know of very few instances where these meth¬ 
ods come with formal guarantees. Among more classical results in this direction are the analyses of 
Lloyd’s algorithm for K-means, which is very closely related to the EM algorithm for mixtures of Gaus- 
sians (Kumar and Kannan, 2010), (Dasgupta and Schulman, 2000), (Dasgupta and Schulman, 2007). The 
recent work of (Balakrishnan et ah, 2014) also characterizes global convergence properties of the EM al¬ 
gorithm for more general settings. Another line of recent work has focused on a different heuristic called 
alternating minimization in the context of dictionary learning. (Agarwal et ah, 2013), (Arora et ah, 2015) 
prove that with appropriate initialization, alternating minimization can provably recover the ground truth. 
(Netrapalli et ah, 2013) have proven similar results in the context of phase retreival. 

Another popular heuristic which has so far eluded such attempts is known as variational infer¬ 
ence (.Iordan et ah, 1999). We provide the first characterization of global convergence of variational inference 
based algorithms for topic models (Blei et ah, 2003). We show that under natural assumptions on the topic- 
word matrix and the topic priors, along with natural initialization, variational inference converges to the 
parameters of the underlying ground truth model. To prove our result we need to overcome a number of 
technical hurdles which are unique to the nature of variational inference. Firstly, the difficulty in analyzing 
alternating minimization methods for dictionary learning is alleviated by the fact that one can come up with 
closed form expressions for the updates of the dictionary matrix. We do not have this luxury. Second, the 
“norm” in which variational inference naturally operates is KL divergence, which can be difficult to work 
with. We stress that the focus of this work is not to identify new instances of topic modeling that were 
previously not known to be efficiently solvable, but rather providing understanding about the behaviour of 
variational inference, the defacto method for learning and inference in the context of topic models. 

2 Latent variable models and EM 

We briefly review expectation-maximization (EM) and variational methods. We will be dealing with latent 
variable models, where the observations Xi are generated according to a distribution 


p{x,\e) = p{z,\9)p{x,\z„0) 

where 9 are parameters of the models, and Zi are termed as hidden variables. Given the observations Xi, a 
common task in this context is to find the maximum likelihood value of the parameter 9: 

argmaxg ^log(P(X,|0)) 

i 

The expectation-maximization (EM) algorithm is an iterative method to achieve this, dating all the 
way back to (Dempster et m 1., 1977) and (Sundberg, 1974) in the 70s. In the above framework it can be 
formulated as the following procedure, maintaining estimates 9*,P*{Z) of the model parameters and the 
posterior distribution over the hidden variables: 
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• E-step: Compute the distribution 


P\Z) = P{Z\X,0^-^) 


• M-step: Set 9* to be 

argmaxg ^ Ept-i [log P{Xi, Zi\9)] 

i 

It’s implicitly assumed however, that both steps can be performed efficiently. Sadly, that is not the case in 
many scenarios. A common approach then is to relax the above formulation to a tractable form. This is 
achieved by choosing an appropriate family of distributions F, and perform the following updates: 

• Variational E-step: Compute the distribution P*{Z) = minpt eFKLiP\Z)\\PiZ\X,9^-^) 

• Variational M-step: Set 0* to be argmaxe [^og P(Ai, Zi|0)] 

By picking the family F appropriately, it’s often possible to make both steps above run in polynomial time. 
None of the above two families of approximations, however, usually come with any guarantees. With EM, 
the problem is ensuring that one does not get stuck in a local optimum. With variational EM, additionally, 
we are faced with the issue of in principle not even exploring the entire space of solutions. 


3 Topic models 

We will focus on a particular latent variable model, which is very often studied - topic models (Blei et al., 
2003). The generative model here is as follows: there is a prior distribution over topics a. Then, each 
document is generated by the following process: 

• Sample a proportion of topics 71 , 72 ,..., 7 fc according to a. 

• For each position in the document, pick a topic according to a multinomial distribution with parameters 
'll, ■■■,1k- 

• Conditioned on topic i being picked at that position, pick a word j from a multinomial with parameters 
(/3i,li /3i,2; ■ ■ ■ , Pi^k) 

In this paper we will be interested in topic priors which result in sparse documents and where the correlation 
of the distributions for different topics is small. These types of properties are very commonly assumed, and 
are satisfied by the Dirichlet prior, one of the most popular priors in topic modeling. (Originally introduced 
by (Blei et ah, 2003).) 

The body of work on topic models is vast (Blei and Lafferty, 2009). Prior theoretical work relevant in the 
context of this paper includes the sequence of works by (Arora et ah, 2012b),(Arora et ah, 2013), as well as 
(Anandkumar et ah, 2013), (Ding et ah, 2013), (Ding et ah, 2014) and (Bansal et ah, 2014). (Arora et ah, 
2012b) and (Arora et ah, 2013) assume that the topic-word matrix contains “anchor words”. This means 
that each topic has a word which appears in that topic, and no other. (Anandkumar et al., 2013) on the 
other hand work with a certain expansion assumption on the word-topic graph, which says that if one takes 
a subset S of topics, the number of words in the support of these topics should be at least jS'j -|- Smax, where 
Smax is the maximum support size of any topic. Neither paper needs any assumption on the topic priors, 
and can handle (almost) arbitrarily short documents. 

The assumptions we make on the word-topic matrix will be related to the ones in the above works, but 
our documents will need to be long, so that the empirical counts of the words are close to their expected 
counts. Our priors will also be more structured. This is expected since we are trying to analyze an existing 
heuristic rather than develop a new algorithmic strategy. The case where the documents are short seems 
significantly more difficult. Namely, in that case there are two issues to consider. One is proving the 
variational approximation to the posterior distribution over topics is not too bad. The second is proving 
that the updates do actually reach the global optimum. Assuming long documents allows us to focus on the 
second issue alone, which is already challenging. On a high level, the instances we consider will have the 
following structure: 
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• The topics will satisfy a weighted expansion property: for any set S of topics of constant size, for any 
topic i in this set, the probability mass on words which belong to i, and no other topic in S will be 
large. (Similar to the expansion in (Anandkuniar et al., 2013), but only over constant sized subsets.) 

• The number of topics per document will be small. Further, the probability of including a given topic in 
a document is almost independent of any other topics that might be included in the document already. 
Similar properties are satisfied by the Dirichlet prior, one of the most popular priors in topic modeling. 
(Originally introduced by (Blei et al., 2003).) The documents will also have a “dominating topic”, 
similarly as in (Bansal et al., 2014). 

• For each word j, and a topic i it appears in, there will be a decent proportion of documents that 
contain topic i and no other topic containing j. These can be viewed as “local anchor documents” for 
that word-pair topic. 

We state below, informally, our main result. See Sections 6 and 7 for more details. 

Theorem. Under the above mentioned assumptions, popular variants of variational inference for topic 
models, with suitable initializations, provably recover the ground truth model in polynomial time. 


4 Variational relaxation for learning topic models 

In this section we briefly review the variational relaxation for topic models following closely the description 
in (Blei et al., 2003). Throughout the paper, we will denote by N the total number of words and K the 
number of topics. We will assume that we are working with a sample set of D documents. We will also 
denote by fdj the fractional count of word j in document d (i.e. fd,j = Count{j)/Nd, where Count{j) is the 
number of times word j appears in the document, and Nd is the number of words in the document). 

For topic models variational updates are way to approximate the computationally intractable E-step 
(Sontag and Roy, 2000) as described in Section 2. Recall the model parameters for topic models are the 
topic prior parameters a and the topic-word matrix jd. The observable X is the list of words in the document. 
The latent variables are the topic assignments Zj at each position j in the document and the topic proportions 
7 . The variational E-step hence becomes P^{Z, 7 ) = minpt^pKL{P*{Z, '-f)\\P{Z, ^\X, a*, /3*) for some family 
F of distributions. The family F one usually considered is P*{"f,Z) = q{j)dlfUiqj{Zj), i.e. a mean field 
family. In (Blei et al., 2003) it’s shown that for Dirichlet priors a the optimal distributions q,qj are a 
Dirichlet distribution for q, with some parameter 7 , and multinomials for ql, with some parameters (f>j. The 
variational EM updates are shown to have the following form. 


• In the E-step, one runs to convergence the following updates on the (f and 7 parameters: 


i,j,i oc 


Eq [log(7<i)|7d] 


Nd 

Id.i = ^d,i d" 'y 

• In the M-step, one updates the /3 and parameters as follows: 

D Nd 

d=lj' = l 

where (j)\ ^ ^ is the converged value of 'pd,j,ii is the word in document d, position j; Wdjj' is an indicator 
variable which is I if the word in position j' in document d is word j. 

The a Dirichlet parameters do not have a closed form expression and are updated via gradient descent. 
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4.1 Simplified updates in the long document limit 

From the above updates it is difficult to give assign an intuitive meaning to the 7 and (j) parameters. (Indeed, 
it’s not even clear what one would like them to be ideally at the global optimum.) We will be however working 
in the large document limit - and this will simplify the updates. In particular, in the E-step, in the large 
document limit, the first term in the update equation for 7 has a vanishing contribution. In this case, we 
can simplify the E-update as: 

<Pd,j,i OC 

Nd 

'yd.i ^ ^ ^d,j^i 
1=1 

Notice, importantly, in the second update we now use variables instead of which are normalized such 

K 

that 7 d^i = 1. These correspond to the max-likelihood topic proportions, given our current estimates P* ^ 

i—1 

for the model parameters. The M-step will remain as is - but we will focus on the /3 only, and ignore the a 
updates - as the a estimates disappeared from the E updates: 

D 

d=l 


where 7 ^ j is the converged value of 'yd,i- In Hiis case, the intuitive meaning of the /3* and 7 * variables is 
clear: they are estimates of the the model parameters, and the max-likelihood topic proportions, given an 
estimate of the model parameters, respectively. 

The way we derived them, these updates appear to be an approximate form of the variational up¬ 
dates in (Blei ot ah, 2003). However it is possible to also view them in a more principled manner. These 
updates approximate the posterior distribution P(Z, yjX, a*,/3‘) by first approximating this posterior by 
P{Z\X,j*,a*,P*), where 7 * is the max-likelihood value for 7 , given our current estimates of a,P, and then 
setting P(Z|X, 7 *,a‘,/ 3 *) to be a product distribution. 

It is intuitively clear that in the large document limit, this approximation should not be much worse than 
the one in (Blei et ah, 2003), as the posterior concentrates around the maximum likelihood value. (And in 
fact, our proofs will work for finite, but long documents.) Finally, we will rewrite the above equations in 

K 

a slightly more convenient form. Denoting fdj — ld,iPi j, the E-step can be written as: iterate until 


convergence 


i=l 


^ fd,J 


id,i = 'yd,z 2 ^ , 

, = 1 Jd.3 


The M-step becomes: 



K 

where = E ld,iPi,j ^>^4 7 ^ ^ is the converged value of 'yd,i- 

i=l 


4.2 Alternating KL minimization and thresholded updates 

We will further modify the E and M-step update equations we derived above. In a slightly modified form, 
these updates were used in a paper by (Lee and Scung, 2000) in the context of non-negative matrix factor¬ 
ization. There the authors proved that under these updates X]d=i ^^^fdjWfdj) is non-decreasing. One can 
easily modify their arguments to show that the same property is preserved if the E-step is replaced by a step 
Yd = ^^ifdWfd), where Ak is the K-dimensional simplex - i.e. minimizing the KL divergence 

between the counts and the ’’predicted counts” with respect to the 7 variables. (In fact, iterating the 7 
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updates above is a way to solve this convex minimization problem via a version of gradient descent which 
makes multiplicative updates, rather than additive updates.) 

Thus the updates are performing alternating minimization using the KL divergence as the distance 
measure (with the difference that for the P variables one essentially just performs a single gradient step). 
In this paper, we will make a modification of the M-step which is very natural. Intuitively, the update for 
Pi j goes over all appearances of the word j and adds the “fractional assignment” of the word j to topic i 
under our current estimates of the variables p, 7 . In the modified version we will only average over those 
documents d, where llii> ih Vi' ^ i. 

The intuitive reason behind this modification is the following. The EM updates we are studying work 
with the KL divergence, which puts more weight on the larger entries. Thus, for the documents in Di, the 
estimates for 7 ^ - should be better than they might be in the documents D\Di. (Of course, since the terms 
fh j involve all the variables 7 ^ j, it is not a priori clear that this modification will gain us much, but we will 
prove that it in fact does.) Formally, we discuss the following three modifications of variational inference 
(we call them tEM, for thresholded EM): 


Algorithm 1 KL-tEM 

• (E-step) Solve the following convex program for each document d: 


mm^/,,,log(:^) 
'I'd.i J d,j 


S.t. 


( 1 ): 7dy > 0 , 7d.i = 1 9 ,nd 7 ^_j = 0 if z does not belong to document d 
(M-step) Let Di to be the set of documents d, s.t. 'jIh > Jd i', Vz' i. 


Set Piy = Pl^- 


d,j 


- 


T'd,, 


Algorithm 2 Iterative tEM 




(E-step) Initialize uniformly among the topics in the support of document d. 
Repeat 


'yd,i — 'yd,i 


N 


fd,j 


f, 

j = l 


until convergence. 
(M-step) Same as above. 


(4.1) 


Algorithm 3 Incomplete tEM 

• (E-step) Initialize jd.i with the values gotten in the previous iteration, just perform one step of 4.1. 
(M-step) Same as before. 


5 Initializations 

We will consider two different strategies for initialization. 

First, we will consider the case where we initialize with the topic-word matrix, and the document priors 
having the correct support. The analysis of tEM in this case will be the cleanest. While the main focus of 
the paper is tEM, we’ll show that this initialization can actually be done for our case efficiently. 

Second, we will consider an initialization that is inspired by what the current LDA-c implementation 
uses. Concretely, we’ll assume that the user has some way of finding, for each topic z, a seed document in 
which the proportion of topic z is at least Ci. Then, when initializing, one treats this document as if it were 
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pure: namely one sets /3°^- to be the fractional count of word j in this document. We do not attempt to 
design an algorithm to find these documents. 

6 Case study 1: Sparse topic priors, support initialization 

We start with a simple case. As mentioned, all of our results only hold in the long documents regime: we 
will assume for each document d, the number of sampled words is large enough, so that one can approximate 
the expected frequencies of the words, i.e., one can find values 7 ^ such that fdj = (1 ± e) 7 d 
We’ll split the rest of the assumptions into those that apply to the topic-word matrix, and the topic priors. 
Let’s first consider the assumptions on the topic-word matrix. We will impose conditions that ensure the 
topics don’t overlap too much. Namely, we assume: 

• Words are discriminative: Each word appears in o{K) topics. 

• Almost disjoint supports: if the intersection of the supports of i and i' is S, PIj < o(l) • 

We also need assumptions on the topic priors. The documents will be sparse, and all topics will be 
roughly equally likely to appear. There will be virtually no dependence between the topics: conditioning on 
the size or presence of a certain topic will not influence much the probability of another topic being included. 
These are analogues of distributions that have been analyzed for dictionary learning (Arora et al., 2015). 
Formally: 

• Sparse and gapped documents: Each of the documents in our samples has at most T = 0(1) topics. 
Furthermore, for each document d, the largest topic io = argmax^y^ ^ is such that for any other topic 

i' ~ "Id io ^ P some (arbitrarily small) constant p. 

• Dominant topic equidistribution: The probability that topic i is such that Jdi ^ "id ^ Os ©(l/AT). 

• Weak topic correlations and independent topic inclusion: For all sets S with o{K) topics, it must 

be the case that: E[ 7 ^ j| 7 ^ ^ is dominating] = (1 ± o(l))E[ 7 ^ ^ is dominating, 7 ^ j, = 0,i' £ S']. 

Furthermore, for any set S of topics, s.t. jS] < T — 1, Pr[ 7 ^ ■ > Ojy^ € S] = ©(-^ j 

These assumptions are a less smooth version of properties of the Dirichlet prior. Namely, it’s a folklore 
result that Dirichlet draws are sparse with high probability, for a certain reasonable range of parameters. This 
was formally proven by (Telgarsky, 2015) - though sparsity there means a small number of large coordinates. 
It’s also well known that Dirichlet essentially cannot enforce any correlation between different topics. ^ 

The above assumptions can be viewed as a local notion of separability of the model, in the following sense. 
First, consider a particular document d. For each topic i that participates in that document, consider the 
words _), which only appear in the support of topic i in the document. In some sense, these words are local 
anchor words for that document: these words appear only in one topic of that document. Because of the 
’’almost disjoint supports” property, there will be a decent mass on these words in each document. Similarly, 
consider a particular non-zero element j of the topic-word matrix. Let’s call Di the set of documents 
where jS*, j = 0 for all other topics i' ^ i appearing in that document. These documents are like local anchor 
documents for that word-topic pair: in those documents, the word appears as part of only topic i. It turns 
out the above properties imply there is a decent number of these for any word-topic pair. 

Finally, a technical condition: we will also assume that all nonzero 7 )) j, /3*j are at least Intuitively, 

this means if a topic is present, it needs to be reasonably large, and similarly for words in topics. Such 
assumptions also appear in the context of dictionary learning (Arora et al., 2015). 

We will prove the following 

Theorem 1. Given an instance of topic modelling satisfying the properties specified above, where the num¬ 
ber of documents is ^ ), if we initialize the supports of the flG and 7 ^ ^ variables correctly, after 

0(log(l/e') -|-logiV) KL-tEM, iterative-tEM updates or incomplete-tEM updates, we recover the topic-word 
matrix and topic proportions to multiplicative accuracy 1 -I- e', for any e' s.t. 1 -|- e' < (pzTyr- 

'^We show analogues of the weak topic correlations property and equidistribution in the supplementary material for com¬ 
pleteness sake. 






Theorem 2. If the number of documents is log^ K), there is a polynomial-time procedure which with 

probability 1 — correctly identifies the supports of the (3*^ and variables. 

Provable convergence of tEM: The correctness of the tEM updates is proven in 3 steps: 

• Identifying dominating topic: First, we prove that if 7 ^ ^ is the largest one among all topics in the 
document, topic i is actually the largest topic. 

• Phase I: Getting constant multiplicative factor estimates: After initialization, after 0{\ogN) rounds, 
we will get to variables IS* 7 ^ ^ which are within a constant multiplicative factor from fi*j, 7 ^ 

• Phase II (Alternating minimization - lower and upper bound evolution): Once the (3 and 7 estimates 
are within a constant factor of their true values, we show that the lone words and documents have a 
boosting effect: they cause the multiplicative upper and lower bounds to improve at each round. 

The updates we are studying are multiplicative, not additive in nature, and the objective they are op¬ 
timizing is non-convex, so the standard techniques do not work. The intuition behind our proof in Phase 
II can be described as follows. Consider one update for one of the variables, say Plj. We show that 
~ ci(3fj + (1 — ct)C*l3*j for some constant C* at time step t. a is something fairly large (one should 
think of it as 1 — o(l)), and comes from the existence of the local anchor documents. A similar equation 
holds for the 7 variables, in which case the “good” term comes from the local anchor words. Furthermore, 
we show that the error in the ~ decreases over time, as does the value of C*, so that eventually we can reach 
(3*y The analysis bears a resemblance to the state evolution and density evolution methods in error decoding 
algorithm analysis - in the sense that we maintain a quantity about the evolving system, and analyze how 
it evolves under the specified iterations. The quantities we maintain are quite simple - upper and lower 
multiplicative bounds on our estimates at any round t. 

Initialization: Recall the goal of this phase is to recover the supports - i.e. to find out which topics are 
present in a document, and identify the support of each topic. We will find the topic supports first. This 
uses an idea inspired by (Arora et ah, 2014) in the setting of dictionary learning. Roughly, we devise a test, 
which will take as input two documents d, d) and will try to determine if the two documents have a topic in 
common or not. The test will have no false positives, i.e., will never say YES, if the documents don’t have 
a topic in common, but might say NO even if they do. We then ensure that with high probability, for each 
topic we find a pair of documents intersecting in that topic, such that the test says YES. ^ 

7 Case study 2: Dominating topics, seeded initialization 

Next, we’ll consider an initialization which is essentially what the current implementation of LDA-c uses. 
Namely, we will call the following initialization a seeded initialization: 

• For each topic i, the user supplies a document d, in which — 

• We treat the document as if it only contains topic i and initialize with = f*^y 

We show how to modify the previous analysis to show that with a few more assumptions, this strategy 
works as well. Firstly, we will have to assume anchor words, that make up a decent fraction of the mass of 
each topic. Second, we also assume that the words have a bounded dynamic range, i.e. the values of a word 
in two different topics are within a constant B from each other. The documents are still gapped, but the 
gap now must be larger. Finally, in roughly \/B fraction of the documents where topic i is dominant, that 
topic has proportion 1 — d, for some small (but still constant) 5. A similar assumption (a small fraction of 
almost pure documents) appeared in a recent paper by (Bansal et ah, 2014). Formally, we have: 

• Small dynamic range and large fraction of anchors: For each discriminative words, if (3*j 0 and 

/3j* ^ 0, (3(j < Bf3*, y Furthermore, each topic i has anchor words, such that their total weight is at 

least p. 

^The detailed initialization algorithm is included in the supplementary material. 
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• Gapped documents: In each document, the largest topic has proportion at least Ci, and all the other 
topics are at most Cg, s.t. 


Cl-Cg> 


P 


^2 l^plog(^) + (1-p) log(BC';)^ + \/log(l + e)^ + . 


• Small fraction of 1 — 6 dominant documents: Among all the documents where topic i is dominating, 
in a 8 /B fraction of them, 7^ j > 1 — < 5 , where 


(5 := 


” p ^^2 l^plog(^) + (1 -p) \ogiBCi)j + \/log(l + e)^ - e, 1 - 


The dependency between the parameters B,p,Ci is a little difficult to parse, but if one thinks of Ci as 
1 — r; for r; small, and p > 1 — , since log(^) « 1 + 77, roughly we want that Ci — Cg pVV- (In other 

words, the weight we require to have on the anchors depends only logarithmically on the range B.) In the 
documents where the dominant topic has proportion 1 — 5 , a similar reasoning as above gives that we want 

1-277 2 

is approximately 7^ j >1 -—I— yCj. The precise statement is as follows: 


Theorem 3. Given an instance of topic modelling satisfying the properties specified above, where the number 
of documents is ^ ), if we initialize with seeded initialization, after O {\og{l/e')log N) of KL- 

tEM updates, we recover the topic-word matrix and topic proportions to multiplicative accuracy I + e', if 
1 + e' > ■ 


The proof is carried out in a few phases: 


• Phase I: Anchor identification: We show that as long as we can identify the dominating topic in each 
of the documents, anchor words will make progress: after 0 {\ogN) number of rounds, the values for 
the topic-word estimates will be almost zero for the topics for which word w is not an anchor. For 
topic for which a word is an anchor we’ll have a good estimate. 

• Phase II: Discriminative word identification: After the anchor words are properly identified in the pre¬ 
vious phase, if fi*j = 0 , fiC will keep dropping and quickly reach almost zero. The values corresponding 
to Pfj 0 will be decently estimated. 

• Phase III: Alternating minimization: After Phase I and II above, we are back to the scenario of the 
previous section: namely, there is improvement in each next round. 

During Phase I and II the intuition is the following: due to our initialization, even in the beginning, each 
topic is ’’correlated” with the correct values. In a 7 update, we are minimizing KL(fd\\fd) with respect to 
the 7d variables, so we need a way to argue that whenever the /? estimates are not too bad, minimizing this 
quantity provides an estimate about how far the optimal 7^ variables are from 7^. We show the following 
useful claim: 


Lemma 4. If, for all topics i, KL{j3*\\Pl) < Rp, and min-y^^AKKL{fd,j\\fd,j) < Rf, after running a KL 
divergence minimization step with respect to the 7^ variables, we get that 117 ^— 7 <i||i < + 

This lemma critically uses the existence of anchor words - namely we show ||/ 3 *v||i > p||7;||i. Intuitively, 
if one thinks of v as 7* — 7*, ||/ 3 *w||i will be large if ||v||i is large. Hence, if \\fi* — / 3 ‘||i is not too large, 
whenever ||/* — /*||i is small, so is II 7 * — 7*||i- We will be able to maintain Rp and Rf small enough 
throughout the iterations, so that we can identify the largest topic in each of the documents. 
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8 On common words 


We briefly remark on common words: words such that j3*j < k^*, n < B. In this case, the proofs 

above, as they are, will not work, ^ since common words do not have any lone documents. However, if 
1 ~ fraction of the documents where topic i is dominant contains topic i with proportion 1 — and 
furthermore, in each topic, the weight on these words is no more than then our proofs still work with 
either initialization^ The idea for the argument is simple: when the dominating topic is very large, we show 
f* ■ 0* ■ 

that 4^ is very highly correlated with so these documents behave like anchor documents. Namely, one 
Jd,j Pi,j 

can show: 

Theorem 5. If we additionally have common words satisfying the properties specified above, after 0(log(l/e')+ 
log N) KL-tEM updates in Case Study 2, or any of the tEM variants in Case Study 1, and we use the same 
initializations as before, we recover the topic-word matrix and topic proportions to multiplicative accuracy 
1 + e', ifl-\-e'> ■ 

9 Discussion and open problems 

In this work we provide the first characterization of sufficient conditions when variational inference leads to 
optimal parameter estimates for topic models. Our proofs also suggest possible hard cases for variational 
inference, namely instances with large dynamic range compared to the proportion of anchor words and/or 
correlated topic priors. It’s not hard to hand-craft such instances where support initialization performs very 
badly, even with only anchor and common words. We made no effort to explore the optimal relationship 
between the dynamic range and the proportion of anchor words, as it’s not clear what are the “worst case” 
instances for this trade-off. 

Seeded initialization, on the other hand, empirically works much better. We found that when Ci > 0.6, 
and when the proportion of anchor words is as low as 0.2, variational inference recovers the ground truth, 
even on instances with fairly large dynamic range. Our current proof methods are too weak to capture this 
observation. (In fact, even the largest topic is sometimes misidentified in the initial stages, so one cannot even 
run tEM, only the vanilla variational inference updates.) Analyzing the dynamics of variational inference in 
this regime seems like a challenging problem which would require significantly new ideas. 


Acknowledgements 

We would like to thank Sanjeev Arora for helpful discussions in various stages of this work. 

^We stress we want to analyze whether variational inference will work or not. Handling common words algorithmically is 
easy: they can be detected and ’’filtered out” initially. Then we can perform the variational inference updates over the rest of 
the words only. This is in fact often done in practice. 

“^See supplementary material. 
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Supplementary material 


A Notation throughout supplementary material 

We will use <, > to denote that the corresponding (in)equality holds up to constants. We will use to 
denote equivalence. We will say that an event happens with high probability, if it happens with probability 
1 — or 1 — ^ for some constant c. 

B Case study 1: Sparse topic priors, support initialization 

B.l Provable convergence of tEM 

As a reminder, the theorem we want to prove is: 

Theorem 1. Given an instance of topic modelling satisfying the Case Study 1 properties specified above, 
where the number of documents is ^ ), if we initialize the supports of the Pi^ and 7 ^^ variables 

correctly, after 0(log(l/e') + log A^) KL-tEM, iterative-tEM updates or incomplete-tEM updates, we recover 
the topic-word matrix and topic proportions to multiplicative accuracy 1 -I- e', for any e' s.t. 1 -|- e' < 

The general outline of the proof will be the following. 

• Identifying dominating topic: For the modified tEM updates, we need to make sure that the topic with 
maximal 7 ^ ^ is the dominant. 

• Phase I: Getting constant multiplicative factor estimates: First, we’ll show that after initialization, after 
O(logiV) number of rounds, we will get to variables Pi j, 7 ^ ^ which are within a constant multiplicative 
factor from 7 *^. 

— Lower bounds on the P and 7 variables: We’ll show that determining the supports of the docu¬ 
ments and the topic-word matrix, as well as being able to identify the documents in which topic 
i is large is enough to ensure that all the Plj and jlj variables are lower bounded by ^ p*j and 

respectively for some constants > 1,(7° > 1 . 

— Improving upper bounds on the Pi j values: We show that, if the above two properties are satisfied, 
we can get a multiplicative upper bound of the Pi j values, which strictly improves at each step 
until it reaches a constant. This improvement is very fast: we only need a logarithmic number of 
steps. After this happens, we show that the 7 variables corresponding to these P estimates must 
be within a constant of the ground truth as well. 

• Phase II (Alternating minimization - lower and upper bound evolution): Once the P and 7 estimates 
are within a constant factor of their true values, we show that the lone words and documents have a 
boosting effect: they cause the multiplicative upper and lower bounds to improve at each round. 

A word about incorporating the ’’correct supports” assumption in our algorithms. For the P variables 
this is obvious: we just set pG = 0 if P* ^ = 0. For the 7 variables it’s also fairly straightforward. In KL-tEM 
we mean simply that in the convex program above, we constrain 7 ^ i = 0 if 7 d i = 0 . 

In the iterative version, this just means that before starting the 7 iterations, we set the initial value to 0 
if 7d i “ 0’ uniform among the rest of the variables. Same for the incomplete version. 

In the interest of brevity, whenever we say ” the supports are correct”, the above is what we will mean. 
Recall, we use t to count the iterations for P variables. Put another way, 7 ^ ^ is the value we get for 7 ^,^ 
after the P variables were updated to Pi ^ (Which of course, implies, will be the values we get for the 
P variables after the 7 variables are updated to 7 ^ j.) 

The proofs are for each of the variants of tEM are similar. For starters, we show everything for KL-tEM, 
and then just mention how to modify the arguments to get the results for the other variants in section B.2. 
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B.1.1 Determining largest topic 

First, we show that the ’’thresholding” operation works. Namely, we show that if 7^ j > 7di')Vz ^ i', then 
7^ j is the largest topic in the document (there is a unique one by the ” slightly gapped documents” property). 
Furthermore, we can say that ^7^ ^ < 7^ i < 27^ 

Lemma 6 . Fix a document d. Let the supports of the 7 and /3 variables be correct. Then, after a 7 iteration, 
if ^ i' ! Id i largest topic in the document. Furthermore, ^ 7 ^ i < Jd i — ‘^^d i- 

Proof. Since there are a constant number of topics in the document, the largest topic has proportion f 2 (l). 
Consider the KL-tEM convex optimization problem. The KKT conditions are easily seen to imply®: 


N 

E 


fd,j 

ft / 
^ d.i 


= 1 


(B.l) 


For each topic i, since we are considering a constrained optimization problem, it has to be the case that 
it either satisfies B.l, 7^ - = 0 or 7^ ■ = 1 . 

Let’s assume first that i satisfies B.l. Then, 


N f 

1=1 

Let’s call the words j, which only appear in the support of topic i in the document lone for that topic, 
and let’s denote that set as Li. 

If Li are the lone words for topic i, /3* ./o fdj = To{l) = o(l), so 


7i 


< E '^)K,ji*d,i + 0(1) < (1 + ^)i*d,i + 0(1) < i*d,i + 0(1) 


leLi 


On the other hand, 7^ ^ ^ (1 “ e)(l “ o(l)) 7 dy > (1 - o(l)) 7 d,i, so 7^^^ > 7;)^^ - o(l). 

Since there is a constant gap of p between the largest topic and the next largest one, the maximum 7^ ^ 
is indeed the largest topic in the document. Furthermore, since (1 — o(l))7^ ^ < 7^ ^ < (1 + o(l)) 7 d v dearly 


hdd ^ < 27d.* follows as well. 

On the other hand, we claim no topic which is in the support of a document d can actually have 7^ ^ = 0 . 
If this happens, it’s easy to see that fd,j^og{^^) = 00: one only needs to look at a summand 

corresponding to a lone word j for topic i. Just by virtue of the way lone words are defined, Jd i ~ ^ would 


imply ■ = 0 . It’s clear that one can get a finite value for f^e other hand, by just 

’ ^d,j 

setting 7d i = 7d i, so 7^ j = 0 cannot happen at an optimum. 

□ 


B.l.2 Lower bounds on the 7 ^ ^ and /3|^ variables 

Next, we show that subject to the thresholding being correct, at any point in time t, all the estimates 7^ ^ 
and Pi j are appropriately lower bounded. 

The proof is similar for both the /3 and 7 variables, and both for the KL-tEM and iterative tEM updates, 
but as mentioned before, we focus on the KL-tEM first. 

Lemma 7. Fix a particular document d. Suppose that the supports of the 7 and /3 variables are correct. 
Then, 7^ ^ > (1 - 

^One gets these trivially, turning the constraint that ^ ^ Lagrange multiplier 
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Proof. Multiplying both sides of B.l by 7 ^ j, we get 


N r 

lb = E 

., = 1 Jd,] 


As above, let’s split the above sum in two parts: lone words, and non-lone. Then clearly, 


For notational convenience, let’s denote a = 13*j. Let’s estimate a. By the assumption on the size 

of the intersection of topics, 

X I31j <Tr = 0(1) 

HU 


Hence, 5 > (1 — e)(l — o(l)) = 1 — o(l). So, the claim of the lemma holds. 


□ 


The lower bound on the j3l ^ values proceeds similarly, but here we will crucially make use of the fact 
that for the large topics, we have both upper and lower bounds on the 7 ^ ^ values. 

Lemma 8. Suppose that the supports of the 7 and (3 variables are correct. Additionally, if i is a large topic 
in d, let < 27 * .. Then, /3‘y > ^(1 - 0 ( 1 ))^*^-. 

Proof. Let’s call lone the documents where /3*/ j — 0 for all other topics i' ^ i appearing in that document 
for the topic-word pair {i,j). Let Di be the set of lone documents. Then, certainly it’s true that 


Qt+l 


> 






However, for a lone document, ^ — Id i ' j easy to check all the other terms in the summation for 
flj vanish, because either 7 ^ ., = 0 or = 0). Hence, 


ot+l 


> 




^-jdGDi '^d,i 

— U ~ 7 

Ed=i Tdd 


However, since the update is happening only over documents where topic i is large, 7 ^ ^ < 27 ^ So, we 
can conclude 

Ot+l > ('1 _ Aft* ^ EdePi Pd,I 
^i,3 — '' 


Ed=l Id.i 


Let’s call a = 1 and let’s analyze it’s value. 

-Xd=l '^d,i 

By Lemma 51 and Lemma 50, 


X Th ^ (1 - e)lA|E[7d,*l7d.i is dominating, 7 ^ = 0,V'i' ^ i s.t. j appears in topic i'\ 


dGDi 


D 


X^d.* < (1 + e)l-C>|E[ 7 ;;_j 7 ;;_, is dominating] 




By the weak topic correlations assumption, then, — (1 “ 

1 Td,i 

Furthermore, by the independent topic inclusion property, each of the o{K) topics other than i that word 
j belongs to appears in a document with probability 0(1/ AT), so the probability that a document which 
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contains topic i contains one of them is o(l), i.e. By Lemma 52, furthermore, > 1 — o(l) when 

e = o(l). Hence, a > 1 — o(l). 

Altogether, we get that > 5(1 — o{l))(3*j as claimed. 

□ 


B.1.3 Upper bound on the values 

Having established a lower bound on the /3|^- variables throughout all iterations, together with the lower 
bounds on the 7 ^ ^ variables and the good estimates for the large topics, we will be able to prove the upper 
bound of the multiplicative error of j keeps improving, until j < CpP*^, for some constant Cp. 

Lemma 9. Let the (3 variables have the correct support, and j, 7 ^ ^ i whenever (3* ^ ^ 0, 

7 ^^ 0. Let Pjj = Cpl3*j, where > 4Cm, and Cm is a constant. Then, in the next iteration, < 

Cl+^I31^, where 

Proof. Without loss of generality, let’s assume Cm > 2. (Since certainly, if the statement of the lemma holds 
with a smaller constant, it holds with Cm = 2 .) 

We proceed similarly as in the prior analyses. We will split the sum into the portion corresponding to 
the lone and non-lone docurnents. 

Let’s analyze the terms "^ 7 ^ ^ corresponding to the non-lone documents. 

Now, , so < (1 + ^)Cm- Also, 7 ^ - < j, since topic i is the dominant in document d. 

^rn 'd Jdj ’ ’ 

Since Cm > 2, ^ + e)C'm7d,*- 

''dj ’ ’ 

Also, note that lii > d: Ed=i 7(5 j, again, since i is the dominant topic. 

As usual, let’s denote the set of lone documents Dp. 


dd<(l + e)Cr. 


J^deDi + J2dGD\Di ^m'ld.iPi.j 


Sd=l Td,i 


As in the prior proofs, let’s denote by a := ^(^’7 

Cd=i 'yd.i 

As in Lemma 8 , a > 1 — o(l), so (3*^ < (1 + ^)Cmio:l3*j + (1 — a)Cm/3ij), which in turn implies that 

'i+1 

3 ^ Tn /-V T*/-I/*» T* +-/~\ /vnrmT'/v ^ 1 3 ^ 0 cm i/*% n+• r/v 


-kd — (1 + e)C'm(a + (1 — Oi)CmC^p). In order to ensure that < ^, it would be sufficient to prove 
Pip 


(1 -I- e)Cm(.OL + (1 — 


which is equivalent to a > 


(^3 


Cl 


m'-'P 2{l+e)C„ 

ns n't — 1 

CmCp J. 


Let’s look at the right hand side. As, by assumption, C^ > iCm, it follows that 


/^3 


2(l+e)Cm ^ 2(l+e)a 


/^3 


n3 nt _ ^ 


/^3 nt _ ^0 
4Cm 


Hence, the right hand side is upper bounded by 


ui- 


--1 


m 2{l+e)C„, 


4 C„ 


ns _ 1 

'-'m 4 C. 


ns _L 

^rn 4C„ 


But, since Cm is bounded by a constant, and a = 1 — o(l), the claim follows. 


□ 


15 















B.1.4 Upper bounds on the 7 values 


Finally, we show that if we ever reach a point where the /3 values are both upper and lower bounded by 
a constant, the 7 values one gets after the 7 step are appropriately upper bounded by a constant. More 
precisely: 

Lemma 10. Fix a particular document d. Let’s assume the supports for the /3 and 7 variables are correct. 
/3* 

Furthermore, let < Cm for some constant Cm- Then, 7^ j < (1 + o(l ))72 j. 

Proof. As in the proof of Lemma 7 , let’s look at the KKT conditions for 7^ ^ into a part corresponding to 
lone words Li and non-lone words. Multiplying B.l by 7^ ^ as before, 

= X/ ■*" Td,i X/ 

ieii •'Tj 

Again, let a = PIj- 

fd 

By Lemma 7 , certainly 7^^^ > ^ 7 d,*- Hence, ^ < (1 + e)C^. So we have, j^i < (1 + e)ia-fli + 

■bd,j 

C^(l — a)^\ i). In other words, this implies 7^ , < 2-(i+t^a°'(i-a) 7d v Since d = 1 — o(l), it’s easy to check 

which is enough for what we need. 

□ 


So, as a corollary, we finally get: 

1 Pi ■ 

Corollary 11. For some to = 0 (log(-gr^))) = OflogN) , it will be the case that for all t > to, < 

Pi,j 

1 Id i 

C'p for some constant C^ and < ( 7 ° for some constant ( 7 °. 

^7 Pd,i 

This concludes Phase I of the analysis. 


B.l.5 Phase II: Alternating minimization - upper and lower bound evolution 

Taking Corollary 11 into consideration, we finally show that, if the fi and 7 values are correct up to a 
constant multiplicative factor, and we have the correct support, we can improve the multiplicative error in 
each iteration, thus achieving convergence to the correct values. 

This portion bears resemblance to techniques like state evolution and density evolution in the literature 
for iterative methods for decoding error correcting codes. In those techniques, one keeps track of a certain 
quantity of the system that’s evolving in each iteration. In density evolution, this is the probability density 
function of the messages that are being passed, in state evolution, it is a certain average and variance of the 
variables we are estimating. 

In our case, we keep track of the ’’multiplicative accuracy” of our estimates 7^ j, flC. In particular, we 
will keep track of quantities ( 7 * and C^, such that at iteration t, -^ < < C^ and ^ < ( 7 * after 

the corresponding 7 iteration. 

We will show that improvement in the quantities C^ causes a large enough improvement in the ( 7 * 
updates, so that after an alternating step of fd and 7 updates, < (( 7 |)^/^. 

First, we show that when the fd variables are estimated up to a constant multiplicative factor, the constant 
for the 7 values after they’ve been iterated to convergence is slightly better than the constant for the /3 values. 
More precisely: 

Lemma 12. Let’s assume that our current iterates fdjj satisfy ^ < ^r— < for Then, 

after iterating the 7 updates to convergence, we will get values 7^ ^ that satisfy 
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Proof. As usual, we will split the KKT conditions for into two parts: one for the lone, and one for 

the non-lone words. Let’s call the set of lone words Li, as previously. Then, we have 


Id 


= E Li + t5, E 


jeLi 


ft 


Again, let a = ~ proved before. 

* t 

Let’s denote as C* = maxi(max(^^, ^^)). 

We claim that it has to hold that C* < Assume the contrary, and let io = argmaxj(max(^^, ^#)). 

Let’s first assume that = Cl. 


~<d,i 


By the definition of C* 


ld,io ~ fd,J+lXio ^ ^ ^)i^'yd,to ~ 


We claim that 


(1 + e)(d + (1 - d)(C‘)^(C;)^) < 


(B.2) 


which will be a contradiction to the definition of C^. 

(pt)l /3 


After a little rewriting, B.2 translates to d > 1 — (gtct) 2 _\ ■ By our assumption on C*, < C^, so the 




right hand side above is upper bounded by 1 -■ 


But, Lemma 10 implies that certainly < C°, where is some absolute constant. The function 


/(c) = 


l+e 


- 1 


c8 - 1 


can be easily seen to be monotonically decreasing on the interval of interest, and hence is lower bounded by 

(go) 8 _i ? which is in terms some absolute constant smaller than one. Since a = 1 — o(l). the claim we want 
is clearly true. 


The case where 


yd,. 


= C* is similar. In this case. 


7l.o = H kj + iko Y. ^ (1 - c)(«7d.,o + (1 - a) 


jeLi, 


_ ft 


{CI)\C\Y 


ih.) 


We then claim that 


Again, B.3 rewrites to: 


(1 e)(a + (l ^ (q)i/3 


{ci)kc*y^ ^ ^ (i-e)(C‘)i/^ 


1 - 




1 - 


(CtC‘)2 


(B.3) 


1 -tTTs 

Again, the right hand side above is upper bounded by 1-;——-. But C.y G [liCi)!, and the 


1 - 

function — 7 — 4 ^— is monotonically increasing, so lower bounded by 

J- “JT 


^ 1 - (1 - e)4/3 ^ 1 

I - , iV. “ l-(l-e)^® “ 42 
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Hence, 1 — 


■I’' — is upper bounded by li. Again, our bound on a gives us what we want. 


□ 


Lemma 13. Let’s assume that our current iterates j3l^ satisfy < C^, C*p > (pz^yf; after the 

corresponding 7 update, we get i < C* where Ci > (C*)^. Then, after one j5 step, we will get new 

^7 '^d,i ^ ^ ^ 

values that satisfy ^ where 


Proof. The proof proceeds in complete analogy with Lemmas 8 and 9. 

Again, let’s tackle the lower and upper bound separately. The upper bound condition is: 


a > 


(r^t ru \2 (C^) ^ 

y-'p'-'l) (l+e)C‘ 

{Cl^ClY - 1 


{CD 




- 1 


Using Cp > (C*)^, we can upper bound the expression on the right by 1 — ,3 -. 


The function 


«l/6 


/(c) = xb/s-i monotonically decreasing on the interval [ 1 , C^] of interest, so because a = 1 — o(l), we get 
what we want. 

Similarly, for the lower bound, we want that 


a > 


g* _ 1 

(C|) 1 / 2 ( 1 _,) (C‘C‘)= 


1 - 




Yet again, using Cp > we get that the right hand side is upper bounded by 


1 - 


1 - 




^ 'h 


1 - 

However, the function /(c) = —— is monotonically increasing on the interval [1, CS], so lower bounded 

i B /Q h' 


1 - 


by 


I — 

■ ( 1 


l-(l-e)g'^ ^ 1 XJ 1 

—> zb- Hence, 1 — 


(l-e)C 


176 




126 ■ 


1 -- 


— is upper bounded by so using the fact 


that a = 1 — o(l), we get what we want. 


□ 


Putting lemmas 12 and 13 together, we get: 

Lemma 14. Suppose it holds that < C*, C* > yj-zpyr- Then, after one KL minimization step 

with respect to the 7 variables and one P iteration, we get new values that satisfy ohfT D D , 
where = VC^ 

Proof. By Lemma 12, after the 7 iterations, we get 7 ^ ^ values that satisfy the condition < (C")‘, 

where {C'f = (C'*)b3. 

Then, by Lemma 13, after the 7 iteration, we will get -yfspr < , such that 

which is what we need. 

□ 


Hence, as a corollary, we get immediately: 
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Corollary 15. Lemma I 4 above implies that Phase III requires 0(log( = 0(log(^)) iterations to 

estimate each of the topic-word matrix and document proportion entries to within a multiplicative factor of 
1 + e'. 

This finished the proof of Theorem 1 for the KL-tEM version of the updates. In the next section, we will 
remark on why the proofs are almost identical in the iterative and incomplete tEM version of the updates. 

B.2 Iterative tEM updates, incomplete tEM updates 

We show how to modify the proofs to show that the iterative tEM and incomplete tEM updates work as 
well. We’ll just sketch the arguments as they are almost identical as above. 

In those updates, when we are performing a 7 update, we initialize with 7 ^ j = 0 whenever topic i does 
not belong to document d, and 7 ^ ^ uniform among all the other topics. 

Then, the way to modify Lemmas 7, 10, 12 is simple. Instead of arguing by contradiction about what 
happens at the KKT conditions, one will assume that at iteration t' {t' to indicate these are the separate 
iterations for the 7 variables that converge to the values 7 ^ j) it holds that ^ 7^7 < 7^7 < C"* ld,i- Then, as 

long as C‘ is too big, compared to C^, one can show that C‘ is decreasing (to C* say), using 

exactly the same argument we had before. Furthermore, the number of such iterations needed will clearly 
be logarithmic. 

But the same argument as above proves the incomplete tEM updates work as well. Namely, even if we 
perform only one update of the 7 variables, they are guaranteed to improve. 

B.3 Initialization 

For completeness, we also give here a fairly easy, efficient initialization algorithm. Recall, the goal of this 
phase is to recover the supports - i.e. to find out which topics are present in a document, and identify the 
support of each topic. To reiterate the theorem statement: 

Theorem 2. If the number of documents is log^ itT), there is a polynomial-time procedure which with 
probability 1 — correctly identifies the supports of the fi*j and 7 ^^ variables. 

We will find the topic supports first. Roughly speaking, we will devise a test, which will take as input 
two documents d, d', and will try to determine if the two documents have a topic in common or not. The 
test will have no false positives, i.e. will never say NO, if the documents do have a topic in common, but 
might say NO even if they do. We will then, ensure that with high probability, for each topic we find a pair 
of documents intersecting in that topic, such that the test says YES. 

We will also be able to identify which pairs intersect in exactly one topic, and from this we will be able to 
find all the topic supports. Having done all of this, finding the topics in each document will be easy as well. 
Roughly speaking, if a document doesn’t contain a given topic, it will not contain all of the discriminative 
words in that document. 

We give the algorithm formally as pseudocode Algorithm 4. 

Now, let’s proceed to analyze the above algorithm, proceeding in a few parts. 

B.3.1 Constructing a no-false-positives test 

First, we describe how one determines the supports of the topics. Let’s define Test{d,d') = YES, if 
J2j 7 , fd' j) ^ and NO otherwise. Then, we claim the following. 

Lemma 16. If d,d' both contain a topic io, s.t. 7 )) > 1/T, 7 ^, > 1/T then Tested, d') = YES. If d,d' 

do not contain a topic io in common, then Test{d,d') = NO. 

Proof. Let’s prove the first claim. 

fd',j} > - e) > 

3 3 

3 
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Algorithm 4 Initialization 

repeat A"* log^ K times 

Sample a pair of documents (d,d'). 

>Test if {d,d') intersect with no false positives: 
if > W then 

Sd,d' ■■= {j,s.t.fl^,f^, j > 0} 

>” Weed-out” words that are not in the support of the intersection of (d,d’) 
for all documents d" {d, d'} do 

if > W and Ej^Hfd',jJd",j} > W then 

Sd,d' = Sd,d'f^j,s-i rd„^,>Q 

end if 
end for 
end if 

until 

oDetermine which Sa,b correspond to documents intersecting in one topic only) 

if Set Sa,b appears less than D/K'^-^ times, where D is the total number of documents then 
Remove Sa,b- 

end if 

if Set Sa^b can be written as the union of two other sets Sc^d, Sej, where neither is contained inside the 

other then 

Remove Sa^b- 

end if 

if Set Sa,b is strictly contained inside Sd,d' for some Sd,d' then 
Remove Sd,d'- 

end if 

Remove duplicates. 

The remaining lists Sa,b are declared to be topic supports. 
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Now, let’s prove the second claim. Let’s suppose d, d' contain no topic in common. 

Let’s fix a topic io that belongs to document d. By the ’’small discriminative words intersection”, we 
have the following property: 

E /3m-= 0(1) 

jeio ,j€i' 

for any other topic i' ^ ig. 

Denoting by Toutside the words belonging to topic io, and no topic in document d', and Tinside the words 
belonging to at least one other topic in d', we have 

For the words j € Toutside, min{/^_^-, = 0 

By the above, 

^ min{/dj, < (1 + e)T'^o{l) = o(l) 
j 

Thus, the test will say NO, as we wanted. 

□ 


B.3.2 Finding the topic supports from identifying pairs 

Let’s call d, d' an identifying pair of documents for topic i, ii d, d' intersect in topic i only, and furthermore 
the test says YES on that pair. 

From this identifying pair, we show how to find the support of the topic i in the intersection. What we’d 
like to do is just declare the words j, s.t. f^ pf^i j are both non-zero as the support of topic i. Unfortunately, 
this doesn’t quite work. The reason is that one might find words j, s.t. they belong to one topic i' in d, 
and another topic i" in d". Fortunately, this is easy to remedy. As per the pseudo-code above, let’s call the 
following operation WEEDOUT{d, d'): 

• Set 5 = {j, s.t.flj > 0, f^, j > 0}. 

• For all d”, s.t. Test{d,d") = YES, Test{d',d") = YES'. 

• Set 5 = ^ U {j, s.t.f^o j > 0} 

• Return S. 

Lemma 17. With probability for any pair of documents d,d' intersecting in one topic, WEEDOUT{d,d') 

is the support of S. 

Proof. For this, we prove two things. First, it’s clear that S is initialized in the first line in a way that 
ensures that it contains all words in the support of topic i. Furthermore, it’s clear that at no point in 
time we will remove a word j from S that is in the support of topic i. Indeed - if Test{d,d") = YES and 
Tested', d") = YES, then by Lemma 16 document d" must contain topic i. In this case, f^u j > 0, and we 
won’t exclude j from S. 

So, we only need to show that the words that are not in the support of topic i will get removed. 

Let d,d' intersect in a topic i. Let a word j be outside the support of a given topic i. Because of the 
independent topic inclusion property, the probability that a document d" contains topic i, and no other topic 
containing j is id{l/K). 

Since the number of documents is Sl{K^ log^ K), by Chernoff, the probability that there is a document 
d”, s.t. Test[d,d") = YES, Test{d' ,d") = YES, but f^„j = 0, is 1 — k )- Union bounding over 

all words j, as well as pairs of documents d, d', we get that for any documents d, d' intersection in a topic i, 
we get the claim we want. 

□ 
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B.3.3 Finding the identifying pairs 

Finally, we show how to actually find the identifying pairs. The main issue we need to handle are documents 
that do intersect, and the TEST returns yes, but they intersect in more than one topic. There’s two 
ingredients to ensuring this is true in the above algorithm. 

• First, we delete all sets in the list of sets Sa,b that show up less than D '^/number of times. 

• Second, we remove sets that can be written as the union of two other sets Sc,d, Sej, where neither of 
the two is contained inside the other. 

• After this, we delete the non-maximal sets in the list. 

The following lemma holds: 

Lemma 18. Each topic has VL{D‘^/identifying pairs with probability 1 — 

Proof. Let li be the event that there are at least /k'^) identifying pairs for topic i. Let Ni be a random 
variable denoting the number of documents which have topic z as a dominating topic. Furthermore, let 
M-i be the event that there are at least -if — K^Nf identifying pairs among the Ni ones that have i as 
a dominating topic. By the dominant topic equidistribution property, probability that a document d has a 
topic z as a dominating topic is at least C/K for some constant C. Then, clearly, 

> Pr [Nr > Pr i^Ni > 

Let’s estimate Pr (A^i > first. The probabilities that different documents have zq as the 

dominating topic are clearly independent, so by Chernoff, if Ni is the number of documents where z is the 
dominating topic, 

Pr[A^* > (1 - e)C--] > 1 - 

K 

Since D = plugging in e = 5 , PT[Ni < > 1 — Union bounding over all topics, we get 

that with probability Pr (A^i > \C^y\ > 1 — 

Now, let’s consider Pr [Ni > ^C^)]. The event flA [Ni > can be written as the 

disjoint union of events 

{D = A, Vz ^ j, A n A = 0} 

where ID) is the set of all documents, Di is the set of documents that have z as the dominating topic, 
and \Di\ > \C^fii. (i.e. all the partitions of ID into K sets of sufficiently large size). Evidently, 
if we prove a lower bound on Pr \r\f^iMi\E'\ for any such event E, it will imply a lower bound on 
Pr (A > Por any such event, consider two documents d,d' £ {A}, i-e. having 

z as the dominating topic. Let Id,d> be an indicator variable denoting the event that d, d' do not intersect 
in an additional topic. Pr[Id_d/ = 1 ] = 1 — o(l), by the independent topic inclusion property and the events 
Id.d' are easily seen to be pairwise independent. Furthermore, Var[I(j,d'] = o(l). By Chebyshev’s inequality, 

Pr ^ Td,w>\Dl-c^^ >1-1 

d.d'GDi 

If Ni = Pl[KlogK), plugging in c = AT, we get that Pr Idd' = ^[Df) > 1 — Hence, 

d,d'£Di 

Pr > 1 — A, by a union bound, which implies Pr [Ni > ^C^)] > 1 — 

Putting all of the above together, ii D = Pl[K'^\ogK), with probability 1 — all topics have 

il[D'^/K'^) identifying pairs, which is what we want. 

□ 
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The lemma implies that with probability 1 — r2(-^), we will not eliminate the sets Sa,b corresponding to 
topic supports. 

We introduce the following concept of a ” configuration”. A set of words C will be called a ” configuration” 
if it can be constructed as the intersection of the discriminative words in some set of topics, i.e. 

Definition. A set of words C is called a configuration if there exists a set / = {/i,..., /|/|} of topics, s.t. 

c = 

Let’s call the minimal size of a set I that can produce C the generator size of C. 

Now, we claim the following fact: 

Lemma 19. If a configuration C has generator size > 3, then with probability 1 — it cannot appear 

as one of the sets Sa,b after step 2 in the WEEDOUT procedure. 

Proof. Since C has generator size at least 3, if two sets d,d' intersect in less than two topics, then step 1 in 
WEEDOUT cannot produce Safi which is equal to C. Hence, prior to step 2, C can only appear as Sdfi' for 
d, d' that intersect in at least 3 topics. 

Let Id,d' be an indicator variable denoting the fact that the pair of documents d, d' intersects in at least 
3 topics. We have PT:\Zdfi' = 1] < l/K^ + l/AT^ + ... 1/K'^ = 0{1/K^) by the independent topic inclusion 
property. 

If I 3 is a variable denoting the total number of documents that intersect in at least 3 topics, again by 
Chebyshev as in Lemma 18 we get: 

Pr[l3 > Q{D/K^) - c&{Vd/K^/^)] > 1 “ ^ 

Again, by putting c = Vk, since the number of documents is K^log^ K, with probability 1 — all 
configurations with generator size > 3 cannot appear as one of the sets Safi, as we wanted. 

□ 

This means that after the WEEDOUT step, with probability 1 — D(-^), we will just have sets Safi 
corresponding to configurations generated by two topics or less. The options for these are severely limited: 
they have to be either a topic support, the union of two topic supports, or the intersection of two topic 
supports. We can handle this case fairly easily, as proven in the following lemma: 

Lemma 20. After the end of step 3, with probability 1 — the only remaining Safi are those corre¬ 

sponding to topic supports. 

Proof. First, when we check if some Sd.d' is the union of two other sets and delete it if yes, I claim we will 
delete the sets equal to configurations that correspond to unions of two topic supports (and nothing else). 
This is not that difficult to see: certainly the sets that do correspond to configurations of this type will get 
deleted. 

On the other hand, if it’s the case that Safi corresponds to a single topic support, we won’t be able to 
write it as the union of two sets Sd,d', Sd",d"', unless one is contained inside the other - this is ensured by 
the existence of discriminative words. 

Hence, after the first two passes, we will only be left with sets that are either topic supports, or inter¬ 
sections of two topic supports. Then, removing the non-maximal is easily seen to remove the sets that are 
intersections, again due to the existence of discriminative words. 

□ 


B.3.4 Finding the document supports 

Now, given the supports of each topic, for each document, we want to determine the topics which are non-zero 
in it. The algorithm is given in 5: 

Lemma 21. If a topic io is such that 7 ^ > 0, it will be declared as ”IN”. If a topic io is such that 7 ^ = 0, 

it will be declared as out. 
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Algorithm 5 Finding document supports 
Initialize R — %. 
for each i do 

Compute Score(i) = EjeSupportii)\R ki 

end for 

Find i* such that Score(**) is maximum, 
while Score(i*) > 0 do 

Output i* to be in the support of d. 

R = RU support{i*) 

Recompute Score for every other topic. 
Find i* with maximum score, 
end while 


Proof. Consider a topic i. At any iteration of the while cycle, consider J2j£Support{i)\R fd,j- Clearly, fdj > 
(1 - ^)lhky T^jeRkj = Hence, 

E kj > (1 - e)7:,,(l - Toil)) > iy:,. 

j^Support{i)\R 


So, topic i will be added eventually. 

On the other hand, let’s assume the document doesn’t contain a given topic iq. Let’s call B the set 
of words j which are in the support of io, and belong to at least one of the topics in document d. Then, 
J2j£io fd,j = 'kijeB fd,j- Let i* be the topic which is present in the document but not added yet and has 
maximum value of j. Then 

E kj < (1 + e) E E kdkj ^ 

j€B ied j€B 


i£d jGB 

(l + e)Ty^,,.o(l)<y,V]-o(l) 

Hence, topic i* will always get preference over Iq. Once all the topics which are present in the document 
have been added, it is clear that no more topic will be added since score will be 0. 

□ 


This finally finishes the proof of Theorem 2. 


C Case study 2: Dominating topics, seeded initialization 

As a reminder, seeded initialization does the following: 

• For each topic f, the user supplies a document d, in which y^ ^ > C;. 

• We initialize with j- 

The theorem we want to show is: 

Theorem 3. Given an instance of topic modelling satisfying the Case Study 2 properties specified above, where 
the number of documents is ^ ), if we initialize with seeded initialization, after 0(log(l/e') + log A) 

of KL-tEM updates, we recover the topic-word matrix and topic proportions to multiplicative accuracy 1-1-eh 

The proof will be in a few phases again: 

• Phase I: Anchor identification: First, we will show that as long as we can identify the dominating 
topic in each of the documents, the anchor words will make progress, in the sense that after 0(log A) 
number of rounds, the values for the topic-word estimates will be almost zero for the topics for which 
the word is not an anchor, and lower bounded for the one for which it is. 
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• Phase II: Discriminative word identification: Next, we show that as long as we can identify the 
dominating topics in each of the documents, and the anchor words were properly identified in the 
previous phase, the values of the topic-word matrix for words which do not belong to a certain topic 
will keep dropping until they reach almost zero, while being lower bounded for the words that do. 

• For Phase I and II above, we will need to show that the dominating topic can be identified at any step. 
Here we’ll leverage the fact that the dominating topic is sufficiently large, as well as the fact that the 
anchor words have quite a large weight. 

• Phase III: Alternating minimization: Finally, we show that after Phase I and II above, we are back to 
the scenario of the previous section: namely, there is a ’’boosting” type of improvement in each next 
round. 


C.l Estimates on the dominating topic 

Before diving into the specifics of the phases above, we will show what the conditions we need are to be 
able to identify the dominating topic in each of the documents. For notational convenience, let be the 
m-dimensional simplex: x € Am iff Vt € [m], 0 < Xj < I and Xi = 1. 

First, during a 7 update, we are minimizing KL{fd\\fd) with respect to the jd variables, so we need some 
way or arguing that whenever the fi estimates are not too bad, minimizing this quantity also quantifies how 
far the 7 ^ variables are from 7 ^. 

Formally, we’ll show the following: 

Lemma 22. If, for all i, iFL(/3*||/3|) < Rp, and min.y^gAK KL(fd\\fd) < Rf, after running a KL divergence 
minimization step with respect to the 7 ^ variables, we get that || 7 ^ — 7 d||i < \Rp + \J\Rf) + £• 

We will start with the following simple helper claim: 

Lemma 23. If the word-topic matrix fi is such that in each topic the anchor words have total probability at 
least p, then ||/3*u||i >p||v||i. 

Proof. 


Lemma 24. If, for all i, KL{/3*\\/3l) < Rp, and min.y^gAi^ KL{fd\\fd) A Rf, after running a KL divergence 
minimization step with respect to the 7 ^ variables, we get that || 7 ^ — 7 <i||i < \Rp + \J\Rf) + £• 

Proof. First, observe that min..,^gAK KL{fd\\fd) < Rf, at the the optimal 7^, we have that \ \fd — fd\\i < \Rf, 
i-e- ll/d - /dll < \J\Rf, by Pinsker’s inequality. 

We will show that if || 7 j^ — 7 d||i is large, so must be \\fd — fd\\i, and hence KL{fd\\fd) - which will 
contradict the above upper bound. 

Let’s consider fi* as N by K matrix, and 7 * and /* as iiT-dimensional vectors. Let / 3 * 7 * just denote 
matrix-vector multiplication - so /* = . For any other vector 7 , let’s denote / = ffiy. Then: 

11/- /111 = 11/- /3‘7iii = 11/- (r + (/3‘ - r))7iii > 


ll/-/3*7lli-||(/3‘-r)7lli (C.l) 

Hence, ||/- /3*7lli < \\{P* - + 11/fill- However, 

However, 

||(/3‘ - r)7lli < max ^- /3N| < ma^ (C.2) 
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The first inequality is a property of induced matrix norms, the second is via Pinsker’s inequality. 

So, by C.l and C.2, ||/ —, 5 * 7 ||i < \Rf- But now, finally. Lemma 23 implies that || 7 ^ — 7 d||i < 

p(\/ 

□ 

Lemma 25. Suppose that for the dominating topic i in a document d, > Ci, and for all other topics 
Idi! < Cs, s.t. Cl — Cs > i(y ^^Rf + y/ ^Rp) + e. Then, the above test identifies the largest topic. 
Furthermore, ^7^ ^ < 7^ ^ < §7^ ^ 

Proof. By Lemma 24, and the relationship between li and total variation distance between distributions, we 
have that - 7* J i (i (^/P/ + ■ 

For the dominating topic i, 7^_^ > Ci — ^ + y^ 1^/3) + ^ ■ On f^e other hand, for any other 

topic i', 7 ^ ,, < C« + i (i {\IWf + \fhRd) + e) ■ Since C; - C, > i + \[^p) + e, 7^* > ld,i '> so 

the test works. 

On the other hand, since 7 *^^ > 7 *^ - (i [\fWf + \/i^) + ^ = l7d,i- Similarly, 

< 7d.i + p (]/+ e < 7^_j + = |7d_j. 

□ 


C.2 Phase I: Determining the anchor words 

We proceed as outlined. In this section we show that in the first phase of the algorithm, the anchor words 
will be identified - by this we mean that we will be able to show that if a word j is an anchor for topic i, 
will be within a factor of roughly 2 from /3*j, and /3-, ^ will be almost 0 for any other topic i'. 

We will assume throughout this and the next section that we can identify what the dominating topic is, 
and that we have an estimate of the proportion of the dominating topic to within a factor of 2. (We won’t 
restate this assumption in all the lemmas in favor of readability.) 

We will return to this issue after we’ve proven the claims of Phases I and II modulo this claim. 

The outline is the following. We show that at any point in time, by virtue of the initialization, j5l ^ is 
pretty well lower bounded (more precisely it’s at least constant times 13*j). This enables us to show that 
/3*, j will halve at each iteration - so in some polynomial number of iterations will be basically 0. 


C.2.1 Lower bounds on the B*, values 

We proceed as outlined above. We show here that the /3l j variables are lower bounded at any point in time. 
More precisely, we show the following lemma: 

Lemma 26. Let j be an anchor word for topic i, and let i' i. Suppose that f3l, ^ < (3C. Then, > 
(1 — e)Cil3*j holds. 

Proof. We’ll prove a lower bound on each of the terms Since the update on the /3 variables is a 

^d,j 'd 

convex combination of terms of this type, this will imply a lower bound on 
For this, we upper bound f\y We have: 


fd,j — f3l,jld,i + (3i',jld,i' 


This means that is a convex combination of terms, each of which is at most Plj. Hence, < Plj 


holds. But then 
wanted. 


> fd,j > (1 - ^)l3t,jl*dd > (1 - ^)CiPly This implies /3‘y > (1 - e)CiP*y as we 

□ 
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C.2.2 Decreasing ^ values 


We’ll bootstrap to the above result. Namely, we’ll prove that whenever Plj > 1/Cpfi*^^ for some constant C/?, 
the jSl, j values decrease multiplicatively at each round. Prior to doing that, the following lemma is useful. 
It will state that whenever the values of the variables /3|/ ^ are somewhat small, we can get some reasonable 
lower bound on the values 7 ^ ^ we get after a step of KL minimization with respect to the 7 variables. 

Lemma 27. Let j be an anchor for topic i, and let i' i. Let jdl, ^ < bPl j. Then, for any doeument d, 
when performing KL divergence minimization with respeet to the variables jd, for the optimum value 7 ^ j, it 
holds that > (1 - e)T^ 7 dy - jh- 


Proof. The KKT conditions B.l imply that if we denote Ai the set of anchors in topic i, i ^ 1- 

By the assumption of the lemma, 

flj < bl^ldd + - 7li) 

Since fd,j > (1 - f)lit,jl*dd' this implies > (1 - e)/3* 

Rearranging the terms, we get 


7d i{l--b)+b ’ 


i-e- EiGA,(l-e)/3* 


7d,, 


d'tdiP-l>)+b - 


< 1 . 


7 L > (1 - e) 


J&Ai 


Id.i 

1-b 


1-b 


>(1 


e)P7d,i - 


b 

1-b 


as we needed. 


□ 


With this in place, we show that the value /3|/ ^ when j is an anchor for topic i i', decreases by a factor 
of 2 after the update for the /3 variables. 

This requires one more new idea. Intuitively, if we view the update as setting to Pi j multiplied by 

f* 

a convex combination of terms a large number of them will be zero, just because =0 unless topic 
i belongs to document d. 

By the topic equidistribution property then, the probability that this happens is only 0(1/K), so if the 
weight in the convex combination on these terms is reasonable, we will multiply P* ^ by something less than 
I, which is what we need. 

Lemma 27 says that if 7 ^ ^ is reasonably large, we will estimate it somewhat decently. If 7 ^ ^ is small, 
then f2j would be small anyway. 

So we proceed according to this idea. 

Lemma 28. Let j be an anchor for topic i. Let P*, ^ < bpjj for i' ^ i, and let Pjj > l/CpP*j for some 
eonstant Cp. Then, PlPj < 6/2/3*^- 
Proof. We will split the P update as 



for some appropriately chosen partition of the documents into three groups Di, £> 2 , D 3 . 

Let Di be documents which do not contain topic i at all, D 2 documents which do contain topic i, and 
ld,i ^ £>3 documents which do contain topic i and 7 ^ < y • 

The first part will just vanish because word j is an anchord word for topic i, and topic i does not appear 
in it, so f^j = 0 for all documents d € Di. 


The second summand we will upper bound as follows. First, we upper bound We have that 

fd,j ^ l^ljLd,i — However, we can use Lemma 27 to lower bound 7X- We have that 7 ^ > 

(1 - e)(T^7d.i - T^) > (1 - e)2(i^7d.i- This alltogether implies ^ Hence, 



1 ‘iCp 
1 — e p 


(1 - m',j 
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Furthermore, J2d"fd i' — \\D\Ci. On the other hand, I claim J2deD2 '^d v — 0{K/\D\). Recall that D is the 
set of documents where topic i' is the dominating topic - so by definition they contain topic i. On the other 
hand, if a document is in D 2 then it contains topic i as well. However, by the independent topic inclusion 
property, the probability that a document with dominating topic i' contains topic i as well is 0{1/K). Hence, 


E 


fd,: 

d€D2 JY,. 






T.did,i 




For the third summand we provide a trivial bound for the terms j/: 


fdj 

fL 


< (1 + ^)Pi,jld,i < (1 + 


P 


Since again, — WD\Ci, and again, the number of document in is at most 0(1/Rr) for the same 

reasons as before, we have that 



< 0{llK)bPl^ 


0{llK)hpl^ 


since 


* 


From the above three bounds, we get that < 0(1/K)bl3l j < 



□ 


Now, we just have to put together the previous two claims: namely we need to show that the conditions 
for the decay of the non-anchor topic values, and the lower bound on the anchor-topic values are actually 
preserved during the iterations. We will hence show the following: 

Lemma 29. Suppose we initialize with seeded initialization. Then, after t rounds, if j is an anchor word 
for topic i, Pl j > (1 - e)CiPl^, and < 2-*CsPly 

Proof. We prove this by induction. 

Let’s cover the base case first. In the seed document corresponding to topic i, 7 ^ j > C;, so at initialization 
> CiPfy On the other hand, if topic i appears in the seed document for topic i', then after initialization 
Pdj < < Pij- Hence, at initialization, the claim is true. 

On to the induction step. If the claim were true at time step t, since /?■, j < 2~*CsPlj, by Lemma 26, 
> CiPfj - so the lower bound still holds at time t -|- 1. On the other hand, since Pl j > CiPfy by 
Lemma 28, at time t + 1, /3-, ^ 

Hence, the claim we want follows. 

□ 


Finally, we show the easy lemma that after the values Plij have decreased to (almost) 0, Plj > ^Pfj- 
Lemma 30. Let word j be an anchor word for topic i. Suppose Pit ^ < 2~^CsP*j and 

t > 10 max(log(A^),log(^^),log(-^)) 

7min P min 

Then API, > Piy > \Ply 

Proof. Let us do the lower bound first. It’s easy to see Pl, jjd,i' < jld i- Hence, 


—P* lit ■ = 

ft t't.J td,i 

^d,j 


fd,j 


Ez' 


-Pi till > 
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Hence, after the update, 


f fd,j nt t \ /I \ ^ Q* * 

9 at ^ (1 ~ 

^i,jid,i 

«‘+i > (1 - e)-B* > -B* 

since 7 * _, < 27 * ^. 

The upper bound is similar. Since 

Jd. i 


Hence, 


AT <(l + e)/3-. 


Sd i*d,i 

Ed7L 


< Wl 


since 7 ^ ^ This certainly implies the claim we want. 


Furthermore, the following simple application of Lemma 27 is immediate and useful: 
Lemma 31. Let t > 10 max(log A^, log —log-gr^). T/ien, 7 ^ ^ > ^ 7 ^ j. 

'min ^min ’ ’ 


□ 


C.3 Discriminative words 

We established in the previous section that after logarithmic number of steps, the anchor words will be 
correctly identified, and estimated within a factor of 2. We show that this is enough to cause the support of 
the discriminative words to be correctly identified too, as well as estimate them to within a constant factor 
where they are non-zero. 

Same as before, we will assume in this section that we can identify the dominating topic. 

We will crucially rely on the fact that the discriminative words will not have a very large dynamic range 
comparatively to their total probability mass in a topic. The high level outline will be similar to the case 
for the anchor words. We will prove that if a discriminative word j is in the support of topic i, then A j will 
always be reasonably lower bounded, and this will cause the values A'j to keep decaying for the topics i' 
that the word j does not belong to. 

The reason we will need the bound on the dynamic range, and the proportion of the dominating topic, 
and the size of the dominating topic, is to ensure that the , 8 ’s are always properly lower bounded. 


C.3.1 Bounds on the Bij values 

First, we show that because the discriminative words have a small range, the values Blj whenever Btj is 
non-zero are always maintained to be within some multiplicative constant (which depends on the range of 
the Blj)- 

As a preliminary, notice that having identified the anchor words correctly the 7 values are appropriately 
lower bounded after running the 7 update. Namely, by Lemma 31, 7 ^ ^ > vl^l*d i 

With this in hand, we show that the Bl j values are well upper bounded whenever B* j is non-zero. 

Lemma 32. At any point in time t, Bl j < (1 + y 

Proof. Since ^Bhldi < fdj we have: 


/3 


t+i 


< 


Sd fd,j 

T.dldd 


< 2 - 


'Yhd fd.,j 

Ed^h 
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On the other hand, we claim that fdj < (1 + ^)Bl3*j. Indeed, fdj < (1 + e) 7d other 

topic i', P*, , < Bp*A. Hence, 

2 Edk, . 2{l + e)DBPl^ 

Edih ~ 

However, since — previous expression is at most 

2(1+ e)DBPl^ 2(l + e)B 

DCi Cl 

So, we get the claim we wanted. 

□ 


The lower bound on the pC values is a bit more involved. To show a lower bound on the pC values is 
maintained, we will make use of both the fact that the discriminative words have a small range, and that 
we have some small, but reasonable proportion of documents where 7^ ^ ~ More precisely, we show: 

Lemma 33. Let Plj < topics i that word j belongs to, and let Plj > %Pij- Then, 

Pk" > %Pl, as well. 

Proof. Let’s call Ds the documents where 'ydi — ^~^- certainly lower bound 

id,i 

First, let’s focus on k^pC. Then, 

Id i 


kj > (1 - e)(l - S)Pl^ 

Furthermore, since J2deDs 7^ > ^ Edeu, k^ Ed^k ^ ‘^T.dlk 

Finally, we claim that 7^ > i. Massaging this inequality a bit, we get it’s equivalent to: 

Pi j 1 

P ~ 2 
fP < 2pk ^ 

ld,iPi,j + 'k2k,i'Pi',j ^ 2 / 3 -j 

i' 

The left hand side can be upper bounded by 

-v* 4- C ^(1 + *^)B^ ot 

^d,iPi,j / . ^d,i' ^2 Pi 


(C.3) 


(C.4) 


7< 


ld,iPi,j + (1 ~ ld,i)-^ 


Ct 




by the assumptions of the lemma. 

So, it is sufficient to show that 7 ^ ^Pjj + (1 — 7 ^ Plj ^ “^Ptj^ however this is equivalent after 

some rearrangement to 7 ^ > 1 ~ — • 

- :=7T - 1 
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It’s certainly sufficient for this that 7 ^ > 1 — = 1 — -gr, but since since 7 ^ ■ > 1 — i5, by the definition 

of (5 and Lemmas 24, 35, 36, this certainly holds. 

Together with C.4 and C.3, we get that 

/3‘y > (1 - e)|(l - > (1 - 

But, by our assumptions, (1 — e)(l — 6)“^ > Ci, so the claim follows. 

□ 


C.3.2 Decreasing /3*, ^ values 

Finally, we show that if the discriminative word j does not belong in topic i' , the value for /3‘, ^ will keep 
dropping. More precisely, the following is true: 

Lemma 34. Let word j and topic i be such that 13*, j = 0 and let P*, j < b. Furthermore, let for all the 
topics i that j belongs to hold: fd* ^ for some constant Cp. Finally, let 7 ^ ^ > ^ 7 ^ j for some 

constant C^. Then, < bl2. 

Proof. We proceed similarly as the analogous claim for anchor words. We split the update as 




E 


fd,: 


d€Di 


d,j 


Td,^ 


E 


fd,‘ 
dGD2 fl^ 




T,dTi^' 


Ed lii 


for some appropriate partitioning of the documents I?i, £> 2 . 

Namely, let Di be documents which do not contain any topic to which word j belongs, the D 2 documents 
which contain at least one topic word j belongs to. 

For all the documents in Di, /J = 0, and we will provide a good bound for the terms in D 2 , this 

way, we’ll ensure Pjj gets multiplied by a quantity which is o(l) to get which is of course enough for 

what we want. 

Bounding the terms in D 2 is even simpler than before. We have: 




Cpa 




Cpa 


-fl 


d,j 


Hence, < CpC^. 

^ d,j 

Then we have: 


^d f^, , !d,i 


d,j 


T,d"fL 


< (l + e)- 


E, 




< 


4(1+ e) 


Edf^7d,, 

'' d,j 


Ed^h 


< 4(1+ e) 


Ed tE 

EdeD2 ^h^'l"fd,i 


Edih 


E Id 

But now, by the ’’weak topic correlation” property, = o(l). Indeed, D consists of the documents 

where i' is the dominating topic. In order for the document to belong to D 2 , at least one of the topics word 
j belongs to must belong in the document as well. Since the word j only belongs to o{K) of the topics, and 
each document contains only a constant number of topics, by the small topic correlation property, the claim 
we want follows. 

But then, clearly, qElfSa —= o(l) as well. 

'^d,i 

Hence, jSlfj = o(l)/3‘, ^ < ^(31, ^ which is what we need. □ 
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C.4 Determining dominant topic and parameter range 

To complete the proofs of the claims for Phase I and II, we need to show that at any point in time we 
correctly identify the dominant topic. Furthermore, in order to maintain the lower bounds on the estimates 
for the discriminative words, we will need to make sure that 7 ^ ^ is large as well in the documents where 

ih ^ 1 

Let’s proceed to the problem of detecting the largest topic first. By Lemma 25 all we need to do is bound 
Rf and Rp at any point in time during this phase. To do this, let’s show the following lemma: 

Lemma 35. Suppose for the anchor words fif j for the discriminative words > (72/3*^-. Let pi 

be the proportion of anchor words in topic i. Then, iFL(/3*||/3|) < Pi log(^) + (1 — Pi) log(^). 

Proof. This is quite simple. Since log is an increasing function, 

KL{l3*\\l3l) = <P*log(;^) + (I -p^)log(^) 

i Pi,3 <^2 


□ 

Lemma 36. Suppose for the anchor words jS* ^ > Ci/3*^ , for the discriminative words /3|^ > C 2 I 3 *y Letpi he 
the proportion of anchor words in topic i. Then, min^ygA^ K LifdWfd) < log(l+e)+^plog(^) + {1 — p) log(^ 

Proof. Also simple. The value of ArL(/d||/d) one gets by plugging in = 7 * is exactly what is stated in the 
lemma. 

□ 


We’ll just use the above two lemmas combined from our estimates from before. We know, for all the 
anchor words, that Plj > CijSfy and that for the discriminative words, Hence, by Lemma 35, 

at any point in time iLL(/3*||/3|) < plog(^) + (1 — p) log(^). So, by Lemma 25, it’s enough that 


Cl - C. > 


i ^^2 ^plog(^) + (1-p) log(HC'i)^ + Vlog(l + e)^ + ( 


Since ^^2 (^p\og{-^) + (1 - p) log(SC'i)) < ^^2 (log(^) + (I -p)logH^, to get a sense of the pa¬ 
rameters one can achieve, for detecting the dominant topic, (ignoring e contributions), it’s sufficient that 
Ci-Cs> I Y/max(log(^), (1 - p) logH) 

If one thinks of Q as I — p and p > 1 — , since log( ^) « p roughly we want that Ci — Csi>^ ydy. (One 

takeaway message here is that the weight we require to have on the anchors depends only logarithmically on 
the range B.) 

Let’s finally figure out what the topic proportions must be in the ’’heavy” documents. In these, we want 
ld,i > 1 “ ^ \ ^^2 (^plog(^) -I- (I -p) log(i?C'i)^ - y'log(I -I- e)^ -I- e. A similar approximation to the 

above gives that we roughly want 7 ^ ^ > 1 — -b |v^- 


C.5 Getting the supports correct 

At the end of the previous section, we argued that after O(logfV) rounds, we will identify the anchor words 
correctly, and the supports of the discriminative words as well. Furthremore, we will also have estimated 
the values of the non-zero discriminative word probabilities, as well the anchor word probabilities up to a 
multiplicative constant. Then, I claim that from this point onward at each of the 7 steps, the 7 * values we 
get will have the correct support. Namely, the following is true: 
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Lemma 37. Suppose for the anchor words and discriminative words j, if /3*j = 0, it’s true that = o(^). 
Furthermore, suppose that if Pfj 0, < Plj < some constant Cp. 

Then, when performing KL minimization with respect to the 7 variables, whenever ^ = 0 we have 
= 0 . 

Proof. Let 7 ^ ^ = 0. If 7 ^ i 7 ^ 0, then the KKT conditions imply: 


N 


ft 


f 

j=l -><1,3 


(C.5) 


The only terms that are non-zero in the above summation are due to words j that belong to at least one 
topic i' in the document. Let I be the set of words that belong to topic i as well. 

By Lemma 31, we know that 7 ^^^ > p/2jl^ Since also fli > Since Pjj = o(i) for 


words j not in the support of topic I, ^ j = o(l). 


_ ft 

3ti 


On the other hand, for words in /, ^/3f < (l-|-e)^^/3* , so 


fd 

intersection property. 

However, this contradicts C.5, so we get what we want. 


-fT^Pi j = 0 ( 1 ), by the small support 

Jd.-i 


□ 


This means that after this phase, we will always correctly identify the supports of the 7 variables as well. 


C.6 Alternating minimization 

Now, finishing the proof of Theorem 3 is trivial. Namely, because of Lemmas 37, 29, and the analogue of 29, 
we are basically back to the case where we have the correct supports for both the /3 and 7 variables. The 
only thing left to deal with is the fact that the /3 variables are not quite zero. 

Let j be an anchor word for topic i. Let e" = 1 — (1 — Similarly as in Lemma 31, for 

t > 10 max(log N, log( „ \ ), log( „ \ )) 

^ "^min ^ '^min 


it holds that > (1 — ■ The same inequality is true if j is a lone word for topic i in document 

Jd,j Pd,i~td,i 

d. 

After the above event, the same proof from Case Study 1 implies that after 0(log(^)) iterations we’ll 


get 


1 


and 


-h e‘ 

1 

T+e^ 

This finishes the proof of Theorem 3. 


- 0 * ■ < 0 ^ ■ 


“O'* - < O'* - 


<i^ + e')0lj 
^ (1 + 


D Justification of prior assumptions 

In this section we provide a brief motivation for our choice of properties on the topic model instances we are 
looking at. Nothing in the other sections crucially depends on this section, so it can be freely skipped upon 
first reading. 

Most of our properties on the topic priors are inspired from what happens with the Dirichlet prior - 
specifically, variants of all of the ’’weak correlations” between topics hold for Dirichlet. Essentially the only 
difference between our assumptions and Dirichlet is the lack of smoothness. (Dirichlet is sparse, but only in 
the sense that it leads to a few ’’large” topics, but the other topics may be non-negligible as well.) 
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To the best of our knowledge, the lemmas proven here were not derived elsewhere, so we include them 
for completeness. 

For all of the claims below, we will be concerned with the following scenario: 

7 = ( 7 ij 72 , • ■ ■, Ik) will be a vector of variables, and a = (oi, 02 , • ■ ■, otk) a vector of parameters. We 
will let 7 be distributed as 7 := Dir{ai,a 2 , ■ ■ ■, oik), where Oi = Ci/K'^, for some constants Ci and c > 1 . 

D.l Sparsity 

To characterize the sparsity of the topic proportions in a document, we will need the following lemma from 
(Telgarsky, 2013): 

Lemma 38. (Telgarsky, 201S) For a Dirichlet distribution with parameters {CxjkT, C^/k^, ■ ■ ■, CkIkT), the 
probability that there are more than colnfc eoordinates in the Dirichlet draw that are > l/k'^° is at most 
llk‘^°. 

It’s clear how this is related to our assumption: if one considers the coordinates > as ’’large”, we 
assume, in a similar way, that there are only a few ’’large” coordinates. The difference is that we want the 
rest of the coordinates to be exactly zero. 

D.2 Weak topic correlations 

We will prove that the Dirichlet distribution satisfies something akin to the weak topic correlations property. 
We prove that when conditioning on some small {o{K)) set of topics being small, the marginal distributions 
for the rest of the topic proportions are very close to the original ones. This implies our ’’weak topic 
correlations” property. 

The following is true: 

Lemma 39. Let 7 = ( 71 , 72 ,..., ^k) be distributed as specified above. 

Let S be a set of topics of size o{K), and let’s denote by 7 s the vector of variables corresponding to the 
topics in the set S, and 75 the rest of the coordinates. Furthermore, let’s denote by jg the distribution of 
75 conditioned on all the coordinates of being at most for ci > 1 . 

Then, for any i € S and 7 = 1 — <5, any S = D(l), 

IP 7 S (7i = 7 ) = (1 ± o(l))P 75 (7z = 7 ) ■ 

Proof. It’s a folklore fact that if T = Dir(a), then 

(Fi, F 2 , ■ ■ ■, F_i, F+i,..., YxlYi = yi) = (1 - yi)Dir{ai,a2,. ■. ,ai-i,ai+i,... ,aK) 

Applying this inductively, we get that 7 s = (I — Sjgs 7j)Dir(as)' Let’s denote s := 
s = Then, since 7 ^ < ior i £ S, s = o(I). Similarly, s = o(l). 

For notational convenience, let’s call cio = oo = cm = do + s. 

The marginal distribution of variable Yi where Y = Dir (a) is Beta(ai, Oq — Oi). 

Hence, 


and 


P 7 S (7i = 7 ) 


B[ai,ao + s - ai) 


Py® (7* = 7 ) = 


I 


Oti-l 


B{ai, ao — ai) 


— Oi I — s I — s 


C^O —Qli — 1 


The following holds: 




(I-s)“-i 


/ (I-,)(1-7) X-' 

V I - s - 7 J 


( 1 - 7 )' = 
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1 + 


1 — s — 7 


(i-7r 


Now, I claim the above expression is 1 ± o(l). 

We’ll just prove this for each of the terms individually. Since 1 + > 1 and — 1 — Oj < —1, it 

follows that (1 + < 1. On the other hand, by Bernoulli’s inequality, (1 + > 

1 — (ofi + 1)73737 > 1 — 0 ( 1 ), since 7 = 1 — d, for some constant 6 , by our assumptions. 

For the second term, since 1 — 7 < 1 and s > 0, (1 — 7 )® < 1. On the other hand, again by Bernoulli’s 

inequality, (1 — 7 )® > 1 — 7 s = 1 — o(l), as we needed. 

Comparing B{ai, do + s — ai) and B{ai, dg — ai) is not so much more difficult. By definition, B{ai, oq ~ 
ai) = fg dx, so 

B{cXi, oo + s — ai) _ 

B{ai,ao - ai) 

fg — 2 ;)“°+®““*“^ dx 

fg (lx 

We’ll just bound each of the ratios 

- a;)“o+5-ai-i 
- a;)“o-ai-i 

Namely, this is just (1 — a;)®. Same as above, 1 — o(l) < (1 — 7 )® < 1. Hence, these are within a constant 
from each other. 

□ 


D.3 Dominant topic equidistribution 

Now, we pass to proving a smooth version of the dominant topic equidistribution property. Namely, for a 
threshold xq = o(l), we can consider a topic ’’large” whenever it’s bigger than xq. We will show that for any 
topics Yi, Yj, the probabilities that Yi > xq and Yj > xq are within a constant from each other. 
Mathematically formalizing the above statement, we will prove the following lemma: 

Lemma 40. Let 7 = ( 71 , 72 , ■ • ■ ,7k) be distributed as specified above. Then, = 0(1), for any i,j 

ifxo = 0(1). 


Proof. As before, the marginal distribution of Yi is Beta(ai,ao ~ Q^i)- The Beta distribution pdf is just 


r) = 




where B{ai,ao — ai) = /g a;“‘ ^(1 —x)““ ^ dx. 


Hence, the ratio we care about can be written as 


(/^^ a;“‘ ^(1 —x)““ “* ^ dx)/B{ai,ao — ai) 
(f^ x‘^i~^(l — a;)“o-“3-i dx)/B{aj,ao — aj) 


To get a bound on this ratio, it’s sufficient to bound the normalization constants B{ai,ao — ai) and 


B{aj,ao — aj), as well as the ratio 7 ? 




■ dx 




^( 1 —a; 
Jxn ' 


dx 


. Let’s prove first that B(ai, ao~cuz) — ao~ 


By definition, B(ai,ao — ai) = a;“*“^(l — dx. The way we’ll analyze this quantity is that 

we’ll divide the integral in two parts, one from 0 to 1 and one from 1 to 1 . 

Since ao = 0(1), it follows that oq — cti — 1 ^ —1 and ao — ai — 1 < 1. Hence, (1 — 3 ;)“°““*“^ = 0(1). 
It follows that 


^( 1 -x) 


ao —Oii —1 


dx ~ 


x“* ^ dx = 


( 1 / 2 )“* _ 1 
CXi CXi 
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where the last equality follows since ^ < ( 1 / 2 )“* < 1 . 


The second portion is not much more difficult. Since 5 < 5 '^’ < 1, it follows 


_ ( 1 / 2 )““-“* _ 1 
OCQ — CKi Qfo 

where the last two equalities come about since — 1 ^ ao — < 1 . 

But the above two estimates proved that for any i, B{ai,ao — cti) ~ as we needed. 

So, we proceed onto bounding 

Ixo ~ x)““-“*-^ dx 

fxo ~ a;)“o-aj-l dx 

We’ll proceed in a similar fashion as before. We’ll pick some point xt, and if x < xt, we will show that 
x“-’-^(l — x)““-“-’-^ is within a constant factor from x“*-^(l — x)““-“*-^. On the other hand, we will show 
that part of the integral where x > xt is dominated by the part where x < xt, which will imply the claim 
we need. 

Let’s rewrite the ratio above a little: 

x“-*-i(l - x)““-“-*-^ _ 

x“*-l(l - 2;)ao-ai-l “ 




' 1 — X 


Proceeding as outlined, I claim that for sufficiently large constants Ci,C 2 , s.t. if x < 1- 1 

1+Cie“i ^ 

then ^ = Oil). Let’s call xt = 1-^— 1 —• 

The claim is then, that if xp > x > xq, that (aj — ai) In(Y^) = 0(1). 

First let’s assume, aj — ai > 0. 


Then, if In(Y^) < 0 x < i, the condition is of course satisfied. So let’s assume x > i. When 

Hence, In(Y^) < InOi + It follows that if 01,02 are 


^ < X < XT, we get that < Oie 
sufficiently large. 


1 1 


(_!_)«i-«i < = 0(1) 

1 — X 


On the other hand, if aj — < 0, when x > ^, (aj — ai) In(Y^) < 0, so we are fine. However, since 

\aj — ai\ < ai, it’s easy to check when x > > xq, that {aj — ai) In(Y^) = 0 ( 1 ). 

Finally, we want to claim that the portion of the integral from xt to 1 is dominated by the portion from 
xo to Xt- 

We can show that the latter portion is 0(6“^), and the first is 0(1). 

Let’s lower bound the first portion. We lower bound /^^’^x“*-^(l — x)““-“*-^ dx by xi^'~^ (1 — 

x)“o-“*-i dx. For the first factor in the above expression, we use Bernoulli’s inequality to prove it’s 0(1). 
For the second, the integral will evaluate to 


(l-xo)““-“* -(1 -xt)““”“* 
ao - ai 

Let’s lower bound the first term in the numerator. If ao — ai > 1, another application of Bernoulli’s 
inequality gives: (1 — xo)““-“* > 1 — (ao — cti)xo > 1 — o(l). If, on the other hand, 0 < ao — Oi < I, 
(1 - xo)““-“* > 1 - Xo > 1 - o(I). 

Then, I claim that (I — xt)““~“* = Indeed, for some constant C 3 , 


1 




< 


.Coe' 
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1 1 

_ g-ln(C3e“j )(ao-ai) 

However, since Oq = ^{Kuj) and Uq — Ui = ri(ao)) the above expression is upper bounded by which 

is what we were claiming. Hence, Ixo ~ dx = ^(1). 

Let’s upper bound the latter portion. This expression is upper bounded by 


(1 - dx = Ai±Ci£ 


T—r- 


Oo - tti 


Now, we will separately bound each of a;^ 

The first term can be written as ■ Now, since 1 — Oi > 0, we can use Bernoulli’s inequality to 

Xrj, * 

lower bound x}f°'' by 1- i (1 — a^). Since - i = 0{lle°‘i ), and 1 — < 1/2, let’s say, 

l+Cie“i^ l+Cie“J^ 

1 -- a^) = n(l), i.e. = 0 ( 1 ). 

1+Cie“j ^ 

For the second term, we already proved above that (1—This implies that a;“’“^(l— 
dx = 0{e~^), which finishes the proof. 

□ 


-1 


and 






D.4 Independent topic inclnsion 

Finally, there’s a very simple proxy for ’’independent topic inclusion”. Again, as above, 75 = (1 — 
EiGsT'*)Dir(as). 

But, if we consider ’’inclusion” the probability that a given topic is ’’noticeable” (i.e. > say), we can 
use the above Lemma 40 to show that the probability that any topic is ’’large” (but still o(l)) is within a 
constant for all the topics in S. 

E On common words 

In this section, we show how one would modify the proofs from the previous section to handle common words 
as well. We stress that common words are easy to handle if one were allowed to hlter them out, but we want 
to analyze under which conditions the variational inference updates could handle them on their own. 

The difference in contrast to the previous sections is it’s not clear how to argue progress for the common 
words: common words do not have lone documents. However, if we can’t argue progress for the common 
words, then we can’t argue progress for the 7 variables, so the entire argument seems to fail. 

Formally, we consider the following scenario: 

• On top of the assumptions we have either in Case Study 1 or Case Study 2, we assume that there are 
words which show up in all topics, but their probabilities are within a constant k from each other, 
B > K > 2. We will call these common words. (The k > 2 is without loss of generality. If the claim 
holds for a smaller k, then it certainly holds for n = 2. The only difference is that the estimates to 
follow could be strengthened, but we assume k > 2 to get cleaner bounds.) 

• For each topic i, if C is the set of common words, j — 7^^ isn’t too much mass on 

these words. 

• Conditioned on topic i being dominant, there is a probability of 1- hju that the proportion of topic 

1 ^ 
i is at least 1 — . 

Then, recall the theorem we want to prove is: 
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Theorem 41. If we additionally have common words satisfying the properties specified above, after 0(log(l/e')+ 
logiV) of KL-tEM updates in Case Study 2, or any of the tEM variants in Case Study 1, and we use the same 
initializations as before, we recover the topic-word matrix and topic proportions to multiplicative accuracy 
1 + e', ifl-\-e'> (iZTyf • 

Our analysis here is fairly loose, since the result is anyway a little weak. (e.g. 1 — is not really the 
best value for the proportion of the dominating topic, or the proportion of such documents required.) At 
any rate, it will be clear from the proofs that the dependency of the dominating topic on k has to be of the 
form 1 — so it’s not clear one would gain too much from the tightest possible analysis. The reason we 
are including this section is to show cases where our proof methods start breaking down. 

We will do the proof for Case Study 1 first, after which Case Study 2 will easily follow. 


E.l Phase I with common words 


The outline is the same as before. We prove the lower bounds on the 7 and /3 variables first. Namely, we 
prove: 

Lemma 42. Suppose that the supports of P and 7 are correct. Then, 7 ^ ^ > 57 ^ j. 

Proof. Similarly as before, multiplying both sides of B.l by 7 ^ we get that 


7L> 


E 




dj 


Pijld.i > (1 - o(l))(l - > 7;ld, 


where the second inequality follows since 1 — fraction of the words in topic i is discriminative. □ 

Lemma 43. Suppose that the supports of the 7 and /3 variables are correct. Additionally, if i is a large topic 
in d, let ^ 7 ^ j < 7 ^ j < 87 ^ j. Then, for a discriminative word j for topic i, > \Pi j- 

Proof. Again, similarly as in Lemma 8 , 


/3*+i > 

^*,.7 — 




l^d=l 'd,i 

In the documents where topic i is the largest, 7 ^ ^ < 87 ^ j. So, we can conclude 


1 J2deD, Td.i 


/3 + > fl - e)P* ■- 


3 Ed=ilh 


Since Ap”' . ’* > (1 — o(l)), as before, we get what we want. 

Ed=l ^dA 


□ 


Lemma 44. Let the P variables have the correct support. Let j be a discriminative word for topic i, and let 
Plj > whenever Pf^ ^ 0, 7 ^ ^ ^ 0. Let pC = C^Plp where C^ > 4(7^, and Cm is a 


constant. Then, in the next iteration, Pl'^^ < where Ci'^^ < -lA. 




Proof. The proof is exactly the same as Lemma 9. 


□ 


Now, we finally get to the upper bound of the 7 values. 
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Lemma 45. Fix a particular document d. Let’s assume the supports for the P and 7 variables are correct. 
Furthermore, let < Cm for some constant Cm- Then, 7 ^ ^ < 27 ^ j. 

Proof. Again, multiplying B.l by 7 ^ j, we get 


'ld,^ = 

HU 


If « = EjeLi since P, .> ^ 


jeLi 

since rd,^ ^ 


j^C ■'d-.j 


If we denote F = '^j^c Pi,j^ 



< (1 + e)Cm 


ld,i < (l + e)(a7E + C'm(l-r-a)7d.*+r«: ld,i) 


Equivalently, 7 *^, < i_(i+,)cg^//_+rE)-(i+.)r .4 7Xz 

Then, we claim that ^ I + ^' Indeed, Fk^ < and (7^(1 - F - 5) < 

Cm{l — a) = 0 ( 1 ). Hence, 


(1 + e)d ^ (1 + e)d ^ (1 + e)d 

1 - (1 + e)C^(l - F - 5) - (1 + e)FK4 “ 1 - o(l) - k-96 - 1 - ^-95 

Finally, we claim that < 1 + Indeed, this is equivalent to 

5 < (1 + e)(l + - k- 95) < (1 + e)(l + k-50) 

But, since we assume k> 2, the claim we need follows easily. □ 

E.2 Phase II of analysis 

Finally, we deal with the alternating minimization portion of the argument. How will we deal with the lack 

of anchor documents? The almost obvious way: if a document has topic i with proportion 1-it will 

behave for all purposes like an anchor document, because the dynamic range of word (3*^ is limited, and the 
contribution from the other topics is not that significant. 

Intuitively, we’ll show that ~ so that these documents provide a ’’push” for the value of /3| in 
the correct direction. 

Lemma 46. Let’s assume that our current iterates Pj j satisfy ^ < Cp for C^ > ■ Then, 

after iterating the 7 updates to convergence, we will get values 7 ^ ^ that satisfy < {C^yP^. 

Proof. As before, we have that 


7li — X/ ^d,j + ld,i X] 

J&Li j^Li dd,j 


Let’s denote as (7‘ = maxi(max(-^, :^)), and let, as before, assume that = C* 

^ '^d,i '^d,iQ ^ 

By the definition of C*, 

Id, 10 = 51 /‘'d+7lio 55 

dd,j 


HLi, 


HLi. 
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We claim that 


(l + e)(575,.o + (l-5)(C^)^(C‘)^75..o) 

(1 + e)(a + (1 - a)(C^)2(C;)2) < 


(E.l) 


which will be a contradiction to the definition of C* 


(pt)1/10 


After a little rewriting, E.l translates to a > 1 — assumption on C*, , so 


(^t) 1/10 


-1 


the right hand side above is upper bounded by 1 - (ct) 8 _i 

But, Lemma 45 implies that certainly C* < C°. The function 


/(c) = 


^ 1/10 

1 +e 


- 1 


- 1 


can be easily seen to be monotonically decreasing on the interval of interest, and hence is lower bounded by 
( 011 ) 8-1 ■ Since a = (1 — o(l))(l — and < 3, the claim we want is clearly true. 

The case where = C* is not much more difficult. An analogous calculation as in Lemma 12 gives 




1 -- 


^ 1/10 

that to get a contradiction to the definition of C*, the condition required is that 1 - ^ {' -. As before. 


(C^ 


2 _ 1 

if /(c) = —, it s easy to check that /(c) is monotonically increasing in the interval of interest, so 
lower bounded by 

1 - 


1 - 


1 -)20\8 


((t^) 

l-(l-c) ^ 1 

1- (l-e)i60 - 160 


But, a > (1 — ■^toit)( 1 — o(l)) > 1 — so we get what we want. 


□ 


Next, we show the following lemma. 


Lemma 47. Suppose at time step t, ^ 7li ^ ^ and < Plj < CpPlj, such that C* < 

(C‘)i/io for Cl > Then, at time step t + 1, where Cl+^ = (C‘)3/4 


Proof. Let’s assume a document d has a dominating topic of proportion at least 1 — 1 /k^°°. 

f* 

Then, we claim that 4^ > mAiu We will do a sequence of rearrangements to get this condition to 

Jd,j 1^/3/ Ci,j 

a simpler form: 
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d,j 
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* I :3 

id,i + id,i —- 


flj ^ 1 /- 

/?*, - (C‘)l/4 PI 
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3 _' 


Let’s upper bound the right hand side by some simpler quantities. We have: 
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(C* )^/4 ^^‘^7 pt ) — 
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(C^C;(7i. + (Q)=X;7i,^) 


Hence, it is sufficient to prove 


/ 3 ?. 




ci 


ci 


JP. 


> E^lAj^Aclf - 1 )^ 

Again, we can upper bound the right hand side by 


Y.Ad,A 


c. 
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(C‘)V4 


(C'l)^- 1 )« = 




c‘ 




So, it is sufficient to prove: 

(1 - < ihA - 

AA'^ ~ jcAyA ((c't)^i/4 - (((^t)^i/4 


- 1 At O 


1 - 


cf 


(CJF73 


Tt/.i — ^ (^t pt 

It’s easy to check that the expression on the right hand side as a function of C* is decreasing. Hence, the 


RHS is upper bounded by 


1 _ 1 

^ tnt 


1 - 


icW^ 


l-j^ + ni{C},rV^o_i) 


Now, let’s analyze this expression. If we let f{x) = 1 — i 


1 -. 


, 3/20 


function of x. Indeed, we can calculate it’s derivative fairly easily: 


, I claim f{x) is an increasing 




™(l-^3W + '«(^^^^^°-l))-(l-y37w)(-^2: 2o+|lAta;2o) 


(1 - yaW + - 1))" 


—X 2 o/t{a: 2 o — I) — |^Ka; 2 o (1 — X 20 ) — (3x + 37x^’^/^°)) 


20 


(1 - ya^lo + - 1))^ (1 - + i(y^ - 1))2 

By the AM-GM inequality, 3x-23/40 + 37x1^/^° > 40((xi^/2°)37{x-23/20)3)i/4o ^ 40a;i4/2O^ go f(^x) is 
positive, so the RHS, as a function of C^, is (x) is increasing. 

So, it is sufficient to satisfy the inequality when = C^. One can check however that by Lemma 43 
and 44 this is true. 

Proceeding to the lower bound, a similar calculation as before gives that the necessary condition for 
progress is: 
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(CbY/^ 
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(C*)l/4 


Ci ' kI Ci, Jew 


Again, the right hand side expression is decreasing in C-y, so it is certainly upper bounded by 

_ 1 - 

1 ~ + K ( (C‘)3V20 ~ 1) 

Now, the claim is that this expression is increasing in C^. Again, denoting f{x) = 1 — —ryj 

_|.^-17/20(i _ ,,3/20 + _ 1)) _ (1 _ „3/20)(_|.„-17/20 _ i37„-57/20) 


f'{x) = -- 


(l-^3/20 + i( 1 _1))2 


- 1) + (1 - x3/20)if X-5V20 ^(_40a,-54/20 + (3„-17/40 + 37„-57/20)) 

(1 _ ,,3/20 + _ 1))2 (1 _ ^3/20 + i(^ _ 1))2 

By the AM-GM inequality, so fix) 

is negative, so the RHS, as a function of C^, is decreasing. So it suffices to check the inequality when 
= (1 — e)^°. In this case, we want to check that 


1 - 


1 - 


> 1 - 


(T=7p 




q;„pp 1_ (1-0^ _ <1_ 

- 37+3k’ 


3 k 


and k>2, this is easily seen to be true. 


Now, we’ll split the j3 update into two parts: documents where topic i is at least 1 — 1 /k^™, and the rest 
of them. In the first group, as we showed above, > m^i/n ■ In the second group, we can certainly claim 

Jd,j 

f* 

that 7 ^ > ttW from the inductive hypothesis. If we denote the set of documents where topic i is at least 

Jd,J ’-'y'-'fi 

1 — I/ft:^°° as £>i, we get that 
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2^d f* . 'd.i 
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Z^i=l 'd,i 
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If we denote /i = 3'^,. 
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So, to prove it’s sufficient to show 
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Given that C* < ((7^)^/^°, it’s sufficient to show 

1 1 

(C|)T72 (C*)23/10 

t^> —^^— 

(C*)9/20 (C*)23/10 


1 1 

1 1 

(C*)9/20 (C‘)23/10 


^9 /20 ^ 1/2 

Completely analogously as before, 1 — 

(C*)3/30 (C|)23/10 

1 _ 1 

(f~'t \^ /20 \ 1 / 2 

to check that // > 1 -t-t -when 

(C‘)3/20 (C*)33/10 ^ ^ 

In the same way, one can prove that 


is a decreasing function of 
, which is easily checked to 


C"!, so it’s sufficient 
be true. 

□ 


Putting lemmas 46 and 47 together, we get that the analogue of Lemma 14: 

Lemma 48. Suppose it holds that < ^Y* < C*, C* > ■ Then, after one KL minimization step 

with respect to the 7 variables and one f3 iteration, we get new values that satisfy ohfT < < (7*+^, 

where (7*+^ = ((7‘)3/4 


As a corollary. 

Corollary 49. Phase III requires 0(log( = 0(log('i)) iterations to estimate each of the topie-word 

matrix and document proportion entries to within a multiplicative factor of 

This finished the proof of Theorem 41 for Case Study 1. 


E.3 Generalizing Case Study 2 

Finally, the proof for Case Study 2 is quite simple. Because the dynamic range k < B for the common words. 
Lemmas 35 and 36 still hold, and hence we again determine the dominant topic correctly. Because of this, 
it’s also easy to see that the lower bounds and upper bounds on the Pi ^ values for the common words are 
maintained to be a constant, since the proof of Lemmas 32 and 33 holds for the common words verbatim. 
This means that the anchor words and discriminative words will be correctly determined just as before. But 
after that point, the analysis of Case Study 2 is exactly the same as the one for Case Study 2 — which we 
already covered in the above section. This finishes the proof of Theorem 41. 


F Estimates on number of documents 

Finally, we state a few helper lemmas to estimate how many documents will be needed. The properties we 
need are that the empirical marginals of a dominating topic in the documents where it’s dominating are 
close to the actual ones, and similarly that the empirical marginals of the dominating topic, conditioned on 
the set of topics that a discriminative word belongs to not being present are close to the actual ones. 

The former statement is the following: 

Lemma 50. Let Ei = E[ 7 ^ ^ is dominating]. If the total number of documents is D = ^ ), and 

Di is the number of documents where i is the dominant topic, then with high probability, for all topics i, 

(1 - e)E, Th < (1 + 

® deDi 


_ _ r1 »_« / < _ -1 ^ E/,; _ _ ^ 

Proof. Since documents are generated independently, Pr[y 5 - EdgDi 7d,i > (1 + e)Ei] < e 3 by Chernoff. 

Since there are at most T topics per document, Ei > so ^d^Di 'yd,! > + ^)^i] ^ 

An analogous statement holds for Pr[-j^ J2deD 7d i ^ ~ ^)Ei] 
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lo ^ 

Then, if Di = > by union bounding, we get that with high probability, for all topics, (1 — e)Ei < 

Wi '^ddDi ld,i < (1 + 

However, the probability of a topic being dominating is CijK for some constant Ci. So, by another 
Chernoff bound, 

Pr[A < (1 - e)C,D/K] < (F.l) 

So, if we take D = ^ log^ K, with high probability, for all topics, Di = Q{D/K). 

Putting everything together, we get that if D = ^ ^ , with high probability, 

(1 - e)E, ^ (1 + 


Next, we calculate how many documents are needed to match the marginals of the dominating topics, 
conditioned on a small subset (of size o{K)) of the topics not being included in a document. More formally. 

Lemma 51. For the discriminative word j, let jS be the set of topics it belongs to. For a topic i € jS, let 
Let Eijs = '^[ldi\ldi dominating, = 0,Vi' € jS]. Let Dijs be the number of documents where i is 
dominating, and 7 ^ = 0, VT € jS. 

If the number of documents D > ^ , then with high probability, for all topics i and discriminative 

words j, (1 - e)E,^js < 'Edeo.js ^ 

Proof. Since Eijs = (1 i o{l))Ei, by the weak topic correlation property, an analogous proof as above shows 

that if we get that if A.jS = with high probability, (1 - e)E,s < EdeUis 

But by the independent topic inclusion property, the probability of generating a document D with i being 
the dominating topic, s.t. no topics in jS appear in it is Q(l/K). So, again by Chernoff, 

Pr[A,,s < (1 - t)C,D/K] < e-T (F.2) 

If we take D = ^ log^ N, PilDijs < (1 ~ e)CiD/K] < e~ ^. However, since the total number of i,jS 
pairs is at most N'^, union bounding, we get that with high probability, for all pairs i,jS, 

(1 - e)Eijs < Y) - X/ < (1 + 


□ 

Finally, the following short lemma to estimate the number of documents in which a word j belongs only 

to the dominating topic is implicit in the proof above: 

Lemma 52. Let Di jg be the number of documents where i is dominating, and 7 ^ j, = 0,Vi' € js- If the 

number of documents D > ^ , then with high probability, for all topics i and discriminative words j, 

A,,s> A(l-e)(l-o(l)) 
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